PopicLab / cue

Deep learning framework for SV calling and genotyping
MIT License
100 stars 19 forks source link

Cue: a deep learning framework for SV calling and genotyping

Table of Contents

Overview
Installation
Tutorial
User Guide
Recommended workflow

Overview

Cue is a deep learning framework for SV calling and genotyping. At a high-level, Cue operates in the following stages illustrated in the figure below:

drawing

The current version of Cue can be used to detect and genotype the following SV types: deletions (DELs), tandem duplication (DUPs), inversions (INVs), deletion-flanked inversions (INVDELs), and inverted duplications (INVDUPs) larger than 5kbp.

For more information please see the following preprint and video.

Installation

Setup a Python virtual environment (recommended)

To deactivate the environment: $> deactivate

Download the latest pre-trained Cue model

The latest pre-trained Cue model can be downloaded from this link.

To download the latest model into the data/models directory:

wget --directory-prefix=data/models/ https://storage.googleapis.com/cue-models/latest/cue.v2.pt

Pre-trained models are stored in the following public Google Cloud Storage bucket.

Data

Synthetic training and benchmark data is available in the public Google Cloud Storage datasets bucket.

Tutorial

We recommend trying the provided demo Jupyter notebook to ensure that the software was properly installed and to experiment running Cue. For convenience, Jupyter was already included in the installation requirements above, or can be installed separately from here. In this demo we use Cue to discover variants in a small BAM file (with the associated YAML config files needed to execute this workflow provided in the data/demo/config directory).

User guide

In addition to the functionality to call structural variants, the framework can be used to execute custom model training, evaluation, and image generation. The engine directory contains the following key high-level scripts to train/evaluate the model and generate image datasets:

Each script accepts as input one or multiple YAML config files, which encode a variety of parameters. Template config files with key parameters are provided in the config directory. The config/custom directory contains template config files with additional parameters that can be useful when generating custom models.

The key required and optional YAML parameters for each Cue command are listed below.

call.py (data YAML):

call.py (model YAML):

train.py:

generate.py:

view.py:

Recommended workflow

  1. Create a new directory.
  2. Place YAML config file(s) in this directory (see the provided templates).
  3. Populate the YAML config file(s) with the parameters specific to this experiment.
  4. Execute the appropriate engine script providing the path to the newly configured YAML file(s). The engine scripts will automatically create auxiliary directories with results in the folder where the config YAML files are located.

Authors

Victoria Popic (vpopic@broadinstitute.org)

Feedback and technical support

For questions, suggestions, or technical assistance, please create an issue on the Cue Github issues page or reach out to Victoria Popic at vpopic@broadinstitute.org.