DNA shearing is a crucial first step in most NGS protocols for Illumina. Enzymatic fragmentation has shown in recent years to be a cost and time effective alternative to physical shearing (i.e. sonication). We discovered that enzymatic fragmentation leads to unexpected alteration of the original DNA source material. We provide fade as a method of identification and removal of enymatic fragmentation artifacts.
Our documentation has information on installing fade and its prerequisites. Or you can install fade
via conda.
conda install -c bioconda fade
fade annotate -b sam1.bam ref.fa > sam1.anno.bam
samtools sort -n sam1.anno.bam > sam1.anno.qsort.bam #recommended but not neccessary
fade out -b sam1.anno.qsort.bam > sam1.filtered.bam
Note: Queryname sorting is suggested in between running fade annotate
and running fade out
.
This is so fade out
can eject whole fragments containing an artifact on either R1 or R2. If
this step is not performed fade out
with simply eject only the read with an artifact (orphaning its mate).
The justification behind this behavior is that we assume the whole fragment is biased by the effects of
enzymatic fragmentation.
docker run -v `pwd`:/data blachlylab/fade annotate -b /data/sam1.bam /data/ref.fa > sam1.anno.bam
docker run -v `pwd`:/data blachlylab/fade out -b -c /data/sam1.anno.qsort.bam > sam1.filtered.bam
Windows
docker run -v C:\path\to\folder:/data blachlylab/fade annotate -b /data/sam1.bam /data/ref.fa > sam1.anno.bam
docker run -v C:\path\to\folder:/data blachlylab/fade out -b -c /data/sam1.anno.qsort.bam > sam1.filtered.bam
Note: fade annotate
works in parallel. Due to this, fade doesn't necessarily write the output in the same
order as the input. Your sorting will be affected. You will likely need to re-sort using samtools sort
if
you would like to use IGV or samtools index
. fade out
when the -c
flag is used will also
affect sorting and mate information as it can modify the starting position of an alignment. If using fade out -c
you should consider also running a tool like Picard's FixMateInformation after re-sorting.
fade
Fragmentase Artifact Detection and Elimination
usage: ./fade [subcommand]
annotate: marks artifact reads in bam tags (must be done first)
out: eliminates artifact from reads(may require queryname sorted bam)
stats: reports extended information about artifact reads
stats-clip: reports extended information about all soft-clipped reads
extract: extracts artifacts into a mapped bam
-h --help This help information.
fade annotate
Fragmentase Artifact Detection and Elimination
annotate: performs re-alignment of soft-clips and annotates bam records with bitflag (rs) and realignment tags (am)
usage: ./fade annotate [BAM/SAM input] [Indexed fasta reference]
-t --threads extra threads for parsing the bam file
--min-length Minimum number of bases for a soft-clip to be considered for artifact detection
-w --window-size Number of bases considered outside of read or mate region for re-alignment
-b --bam output bam
-u --ubam output uncompressed bam
-h --help This help information.
fade out
Fragmentase Artifact Detection and Elimination
out: removes all read and mates for reads contain the artifact (used after annotate and requires queryname sorted bam)
or, with the -c flag, hard clips out artifact sequence from reads
usage: ./fade out [BAM/SAM input]
-c --clip clip reads instead of filtering them
-t --threads extra threads for parsing the bam file
-b --bam output bam
-u --ubam output uncompressed bam
-h --help This help information.
fade stats
Fragmentase Artifact Detection and Elimination
stats: reports extended information about artifact reads (used after annotate)
-t --threads threads for parsing the bam file
-h --help This help information.
fade stats-clip
Fragmentase Artifact Detection and Elimination
stats-clip: reports extended information about all soft-clipped reads (used after annotate)
-t --threads threads for parsing the bam file
-h --help This help information.
fade extract
Fragmentase Artifact Detection and Elimination
extract: extracts artifacts into a mapped bam
usage: ./fade extract [BAM/SAM input]
-t --threads extra threads for parsing the bam file
-b --bam output bam
-u --ubam output uncompressed bam
-h --help This help information.
FADE is written in D and uses the htslib library via dhtslib, and the parasail library via dparasail. FADE accepts SAM/BAM/CRAM files containing reads that have been mapped to a reference genome and filters or cleans up artifact-containing reads according to the following procedure.
FADE is designed to determine a sequencing read’s enzymatic artifact status by employing aligner soft-clipping. Soft-clipping is an action performed by the aligner to improve the alignment score of a read to the reference by ignoring a portion on one end of the read. Soft-clipping can help an aligner correctly align a read that has sequencing error on one end of the read or has adapter contamination. FADE employs soft-clipping to identify potentially enzymatic artifact containing reads.
FADE makes available several subcommands that all rely on the algorithm described above.
* The 300 nt padding on each end of the mapped region provides ample search space for
artifact alignment search without being too computationally expensive; most artifacts originate
very close to the mapped region and 300 nt was chosen as an optimal tradeoff, but could be adjusted.
** Harsher gap penalties allows the algorithm to be strict in allowing gaps,
since we expect the artifact sequences to directly match the reference, except for soft-clipped
regions derived from sequencing error. A soft-clipped region is considered to be an artifact if there
is a 90% or greater match to the opposite strand sequence.