This repository contains a GPU implementation of Darwin [1][2], a hardware-friendly DNA aligner. It consists of two parts: D-SOFT and GACT, which represent typical seed-and-extend methods. D-SOFT (Diagonal-band based Seed Overlapping based Filtration Technique) filters the search space by counting non-overlapping bases in matching Kmers in a band of diagonals. GACT (Genomic Alignment using Constant Tracebackmemory) can align reads of arbitrary length using constant memory for the compute-intensive step.
This implementation can be used to run on CPU only, or use the GPU-accelerated version. For more choices between individual optimizations, go back to commit e472745e. Compile for the CPU with './z_compile.sh', or './z_compile.sh GPU' for the GPU version. Other compile options are 'TIME', which measures the CPU and GPU time during GACT for the GPU version, and 'NOSCORE', which removes the score calculation, all overlaps will have a reported score of 0 in this case.
To allow a more flexible substitution matrix, put back the 'gact_sub_mat' variable in darwin.cpp.
Usage: ./darwin
For 50MB of PacBio human data, taken from the 54x dataset, 8 32 64 was found to be the best run configuration. The included reads.fasta is a 10x E.coli dataset, generated by PBSIM. The origin in the genome and readlength are put in the name, these are used by the measurement_sensitivity_PBSIM script.
The Makefile assumes Compute Capability 3.5.
Typical run: ./z_compile.sh GPU ./run.sh 8 32 64 cat darwin.*.out | sort | uniq > out.darwin ./measure_sensitivity_PBSIM.py
[1] Darwin: A Hardware-acceleration Framework for Genomic Sequence Alignment https://www.biorxiv.org/content/early/2017/01/24/092171
[2] Darwin: A Genomics Co-processor Provides up to 15,000X Acceleration on Long Read Assembly https://dl.acm.org/citation.cfm?id=3173193