annalam / breakfast

Software for detecting genomic structural variants from DNA sequencing data
MIT License
7 stars 4 forks source link

Breakfast

Breakfast is a software for detecting genomic structural variants from DNA sequencing data. Its features include:

Installation

Install Rust (version 1.31 or later). Then run the following command:

cargo install --force --git https://github.com/annalam/breakfast

Running Breakfast

To run BreakFast, you need a BAM file containing sequenced reads (in this example, tumor.bam). You also need a Bowtie index and the Bowtie1 executable in your PATH. A Breakfast analysis begins with the breakfast detect command, which searches the BAM file for unaligned reads that support a genomic breakpoint:

breakfast detect tumor.bam bowtie_indexes/hg38 > tumor.sv

Once a *.sv file has been produced for all of your BAM files, you can use the breakfast filter command to require a minimum number of supporting reads, discard short indels, and filter out germline rearrangements and technical artifacts using a sample sheet that describes which samples are tumor samples and which samples are germline or control samples:

breakfast filter --min-reads=4 --min-distance=10 --output-dir=filtered sample_sheet.sv

Finally, the breakfast annotate command can be used to annotate *.sv files with information about adjacency of rearrangement breakpoints to genes:

breakfast annotate tumor.sv genes.bed > tumor.annotated.sv

Detailed overview of the Breakfast algorithm

Unaligned reads are split into two anchors of customizable size: one anchor from the 5' end of the read, and one anchor from the 3' end of the read. These anchors are then aligned against the reference genome using a Bowtie index. If both anchors align to the reference genome (but the read as a whole did not), the read is considered to support the existence of a genomic rearrangement. Aligned reads in the input BAM file are omitted from analysis.

Duplicate DNA fragments are identified based on "fragment signatures". For each unaligned read, a fragment signature is generated by taking the first 8 bases of the read, and the first 8 bases of its paired mate. This sequence identifies the boundaries of the DNA fragment. When reporting evidence for an identified genomic breakpoint, Breakfast only reports one read from each cluster of reads that shares the same fragment signature. In this situation, Breakfast preferentially picks the read that has the highest degree of overlap with the genomic breakpoint (i.e. longest flanks).