_ _ _
(_) (_)| |
___ _ _ _ __ ___ _ __ ______ _ __ ___ _ _ __ _ | |_ _ _ _ __ ___ _ __
/ __|| | | || '_ \ / _ \| '__||______|| '_ ` _ \ | || '_ \ | || __|| | | || '_ \ / _ \| '__|
\__ \| |_| || |_) || __/| | | | | | | || || | | || || |_ | |_| || |_) || __/| |
|___/ \__,_|| .__/ \___||_| |_| |_| |_||_||_| |_||_| \__| \__, || .__/ \___||_|
| | __/ || |
|_| |___/ |_|
super-minityper
is a set of cloud-based workflows for constructing SV graphs and mapping reads to them.
Structural variants frustrate read mapping because aligners often choose not to map read portions which map very distantly. Graphs allow incorporating known variants, including large ones, and then mapping directly to these. While this has been shown to reduce reference bias and improve read mappings when a sample contains variants in the graph, constructing graph genomes and operating on them has historically been difficult and time-consuming.
We present a set of cloud-based workflows — composed mostly of preexisting and optimized tools — to construct graphs containing structural variants and map reads to them. These workflows allow users to take arbitrary SV calls, construct a graph, and map reads to these graphs. This workflow prioritizes ease-of-use and speed, ingesting common input formats and returning results in minutes on commodity cloud VMs.
Relevant slides are here and in the docs directory.
super-minityper
is implemented as a DNAnexus cloud applet. There are analgous WDL files for performing all analyses as well.
The general workflow(s) are outlined in the below figure:
A graph can be constructed from either a FASTA reference genome and structural variant calls (in the VCF format) or a set of long reads, which are aligned and assembled using read-read pairwise alignment and the seqwish variation graph inducer.
The image includes the following tools:
Dockerfile is available at here
Build from source code
VERSION=0.1
docker build -t ncbicodeathons/superminityper:${VERSION}-cpu -f build/docker/Dockerfile --target super-minityper-cpu .
In addition to above tools, this image also includes NVIDIA's GPU-accelerated tools that are available from Clara Genomics Analysis SDK and racon-gpu
Note: the tools are compiled with CUDA v10.0 and nvidia-docker is required to run the image.
In Python, you can import claragenomics modules:
from claragenomics.io import fastaio
help(fastaio.write_fasta)
# write_fasta(seqs, filepath, gzip_compressed=False)
# Writes a fasta file for sequences.
# Args:
# seqs: list of 2-tuples containnig sequnces and their names, e.g [('seq1', 'ACGTC...'),
# ('seq2', 'TTGGC...'), ...]]
# filepath: path to file for writing out FASTA.
# gzip_compressed bool: If True then the read component of the sequence has been compres
# sed with gzip
# Returns:
# None.
Dockerfile is available at here
Build from source code
VERSION=0.1
docker build -t ncbicodeathons/superminityper:${VERSION} -f build/docker/Dockerfile --build-arg CUDA_VERSION=10.0 .
This image enables easy to use DNANexus's dxWDL compiler
Build from source code
DXWDL_VERSION=1.32
DXTOOLKIT_VERSION=0.288.0
DXTOOLKIT_PLATFORM=ubuntu-16.04-amd64
docker build -t ncbicodeathons/superminityper:dx-wdl-${DXWDL_VERSION}-${DXTOOLKIT_VERSION} \
-t ncbicodeathons/superminityper:dx-wdl \
--build-arg DXWDL_VERSION=${DXWDL_VERSION} \
--build-arg DXTOOLKIT_VERSION=${DXTOOLKIT_VERSION} \
--build-arg DXTOOLKIT_PLATFORM=${DXTOOLKIT_PLATFORM} \
-f build/docker/Dockerfile --target dx-wdl-builder .
Usage
docker run -it --name dx-builder -v $(pwd)/wdl:/workspace/wdl ncbicodeathons/superminityper:dx-wdl
# Inside the container:
source dx-toolkit/environment
dx login # login with your account
PROJECT_ID=<Project ID in DNANexus such as 'project-afQbBk8c5PvdQqQe6gqfX2gz'>
# Create Workflow for Pipeline1 (minimap2 | fpa | seqwish)
java -jar dxWDL.jar compile wdl/SuperMiniTyper_Pipeline1.wdl -project ${PROJECT_ID}
# Create Workflow for Pipeline1 with GPU (cudamapper | fpa | seqwish)
java -jar dxWDL.jar compile wdl/SuperMiniTyper_Pipeline1_GPU.wdl -project ${PROJECT_ID}
# Create applets for each tool.
java -jar dxWDL.jar compile wdl/SuperMiniTyper_minigraph.wdl -project ${PROJECT_ID}
java -jar dxWDL.jar compile wdl/SuperMiniTyper_minimap2.wdl -project ${PROJECT_ID}
java -jar dxWDL.jar compile wdl/SuperMiniTyper_seqwish.wdl -project ${PROJECT_ID}
java -jar dxWDL.jar compile wdl/SuperMiniTyper_svaha2.wdl -project ${PROJECT_ID}
java -jar dxWDL.jar compile wdl/SuperMiniTyper_cudamapper.wdl -project ${PROJECT_ID}
# After exit from the container's shell, the container is still there and you can attach the container by executing the following command
docker start dx-builder
docker attach dx-builder
# If you want to create new container, please execute `docker run` command after removing the existing container by the following command:
docker rm dx-builder
Dockerfile for erictdawson/base
: https://github.com/edawson/dawdl/blob/master/base/base.Dockerfile
minigraph (dockerhub): erictdawson/minigraph
svaha2 is a C++ tool for quickly generating simple structural variation graphs from a reference genome and a set of variants a VCF file. It does this using a construction algorithm that is designed to operate in time linear to the number of breakpoints. It also prioritizes efficient use of RAM.
source code: https://github.com/edawson/svaha2
The full algorithm is described in the svaha2 repository, but the main innovation is that pre-calculating the breakpoints (and sorting them) allows one to generate a graph with a coordinated ID space without requiring caching any of the graph in memory.
The output of svaha2 is a graph described in GFA v1.0.
minigraph provides an experimental read-to-graph mapper. We map reads to the GFA v1.0 graph output by svaha2. This enables fast, push-button read-to-variants alignment with just a FASTA reference, a VCF containing SVs, and a FASTA/Q readset.
source code: https://github.com/lh3/minigraph
GAF (output) format: https://github.com/lh3/gfatools/blob/master/doc/rGFA.md#the-graph-alignment-format-gaf
We use the minimap2 + seqwish flow to generate de novo graphs in the GFA1 format. This serves as the basis for aligning new reads to a reference graph.
Source code:
minimap2 - https://github.com/lh3/minimap2
seqwish - https://github.com/ekg/seqwish
fpa - https://github.com/natir/fpa
In our workflow, minimap2 generates an all-to-all mapping with alignments to generate a PAF file.
minimap2 -x ava-pb <.fa.gz> <.fa.gz> -t <threads> -c -X > out.paf
The set of overlaps generated by minimap2 is filtered using fpa to drop any overlaps where the alignment block is less 10kb. This reduces complexity of the graph by getting rid of edges corresponding to small overlaps.
fpa -i out.paf drop -l 10000 > filtered_out.paf
The filtered PAF, along with the reads, is then passed into seqwish
to generate a GFA.
seqwish -s <.fa.gz> -p filtered_out.paf -g out.gfa -b out.graph -t <threads>
The WDL files for this pipeline are in minimap2 WDL and seqwish WDL.
The WDL file for minimap2 | fpa | seqwish
is available in here.
Once we have the GAF output (alignments of reads to a graph), we do some basic analysis of the contents of the GAF
file using scripts under the analysis
folder.
We are starting with simple counts such as number of alignments, and a breakdown of how well the queries are aligned to the graph in each alignment.
The gaf.py
script can be extended to understand the GAF structure in more depth.
A GPU accelerated implementation of minimap2 is under development in the Clara Genomics Analysis SDK.
source: https://github.com/clara-genomics/ClaraGenomicsAnalysis
Once the tool is more stable, an alternative pipeline would be replace the minimap2
step with cudamapper
for faster overlap generation.
An experimental WDL file for using cudamapper
has been made avaialble in cudamapper WDL and Workflow WDL file for the pipeline (cudamapper | fpa | seqwish
) is available in here.
We don't report any intermediate statistics on our assembly graph from minimap2/cudamapper
and seqwish
.
However, this information would be useful. Adding a DNAnexus cloud applet for GFAKluge
would allow users to see the maximum graph length, contig N50, etc.
minigraph
is still an experimental mapper. Other mappers such as vg have
been used more extensively and may provide different results. Additional mappers could be added easily by incorporating
a new applet, assuming the mapper can consume GFA.
The end goal of super-minityper
is to be a fast, easy-to-use SV genotyper.
To accomlish this, we need to be able to translate between node IDs in the graph
and variants in the GFA graph. This will require extending svaha2
to incorporate variant
information into its output.
We also need to extend the output of svaha2
to include the necessary tags for full rGFA compatibility.
This enables stable, reference-relative mappings and annotations and better compatibility with minigraph.