Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI).
UMIs are a popular way to identify duplicate DNA/RNA reads caused by PCR amplification. This requires software for collapsing duplicate reads with the same UMI, while accounting for sequencing/PCR errors. This tool implements many efficient algorithms for orders-of-magnitude faster UMI deduplication than previous tools (UMI-tools, etc.), while maintaining similar functionality. This is achieved by using faster data structures with n-grams and BK-trees, along other techniques that are carefully implemented to scale well to larger datasets and longer UMIs. Users of UMICollapse have reported speedups from taking hours or days to run with a previous tool to taking only a few minutes with this tool with real datasets!
The preprint paper is available here and it has been published in PeerJ. If you use this code, please cite
@article{liu2019algorithms,
title={Algorithms for efficiently collapsing reads with Unique Molecular Identifiers},
author={Liu, Daniel},
journal={bioRxiv},
year={2019},
publisher={Cold Spring Harbor Laboratory}
}
UMICollapse can be installed using conda
:
conda install -c bioconda umicollapse
It is also available as a nf-core
module.
(Thanks to @CharlotteAnne!)
Alternatively, you can clone this repository:
git clone https://github.com/Daniel-Liu-c0deb0t/UMICollapse.git
cd UMICollapse
Then, install the dependencies, which are used for FASTQ/SAM/BAM input/output operations. Make sure you have Java 11.
mkdir lib
cd lib
curl -O -L https://repo1.maven.org/maven2/com/github/samtools/htsjdk/2.19.0/htsjdk-2.19.0.jar
curl -O -L https://repo1.maven.org/maven2/org/xerial/snappy/snappy-java/1.1.7.3/snappy-java-1.1.7.3.jar
cd ..
Now you have UMICollapse installed!
First, get some sample data from the UMI-tools repository. These aligned reads have their UMIs extracted and concatenated to the end of their read headers (you can do this with the extract
tool in UMI-tools, using "_
" as the UMI separator). Make sure you have samtools
installed to index the BAM file.
mkdir test
cd test
curl -O -L https://github.com/CGATOxford/UMI-tools/releases/download/1.0.0/example.bam
samtools index example.bam
cd ..
Finally, test/example.bam
can be deduplicated.
./umicollapse bam -i test/example.bam -o test/dedup_example.bam
The UMI length will be autodetected, and the output test/dedup_example.bam
should only contain reads that have a unique UMI. Unmapped reads are removed. One goal of UMICollapse is to offer similar deduplication results as UMI-tools, so it can be easily integrated into existing workflows.
Here is a hypothetical example with paired-end reads:
./umicollapse bam -i paired_example.bam -o dedup_paired_example.bam --umi-sep : --paired --two-pass
This should be equivalent to the following with UMI-tools:
umi_tools dedup -I paired_example.bam -S dedup_paired_example.bam --umi-separator=: --paired
By default, clusters/groups of reads with the same UMI are collapsed into one consensus read. It is possible to only mark duplicate reads with the --tag
option. A sample output SAM/BAM record would look like
SRR2057595.13407254_ACCGGTTTA 16 chr1 3812795 255 50M * 0 0 * * XA:i:2 MD:Z:41T2T5 MI:Z:3389 NM:i:2 RX:Z:ACCGGTTTA cs:i:74 su:i:74
The above record is the consensus read of a group with ID 3389
. The cluster/group size (cs
in BAM/SAM mode or cluster_size
in FASTQ mode) is 74
, and all of the UMIs in the group are the same because the attribute su = 74
(or same_umi
in FASTQ mode) indicates the number of reads with the exact same UMI. Note that only the consensus read of each cluster would have the cluster size tag, so typically reads that are not consensus reads would only have the cluster ID as their only tag. Reads that are not the consensus read will also be marked with the duplicate flag in the SAM/BAM record. Note that only the forwards reads are tagged in paired-end mode. This also currently does not work with --two-pass
. In fastq
mode, tags are appended to the header of each read.
The examples above are based on the workflow where reads are aligned to produce SAM/BAM files before collapsing them based on their UMIs at each unique alignment coordinate. It is also possible to collapse reads based on their sequences directly, without aligning. This may be preferable or faster in some workflows. This can be done by specifying the fastq
option instead of bam
and providing an input FASTQ file:
./umicollapse fastq -i input.fastq -o output.fastq
It is important to note that UMIs are first collapsed by identity (exact same UMIs), and then grouped/clustered using the directional/adjacency/connected components algorithms that allow for some errors/mismatches.
Run
./build.sh
to build the executable .jar
file.
Running basic tests after the .jar
file is built:
./test.sh
There are also some small scripts for testing and debugging. For example, comparing two files to check if the UMIs are the same can be done with:
./run.sh test.CompareDedupUMI test/dedup_example_1.bam test/dedup_example_2.bam
or running benchmarks:
./run.sh test.BenchmarkTime 10000 10 1 ngrambktree
sam
or bam
: the input is an aligned SAM/BAM file with the UMIs in the read headers. This separately deduplicates each alignment coordinate. Unmapped reads are removed.fastq
: the input is a FASTQ file. This deduplicates the entire FASTQ file based on each entire read sequence. In other words, the entire read sequence is treated as the "UMI".-i
: input file. Required.-o
: output file. Required.-k
: number of substitution edits to allow. Default: 1.-u
: the UMI length. If set to a length in fastq
mode, then trims the prefix of each read (note: does not affect the sequence used for deduplicating). Default: autodetect.-p
: threshold percentage for identifying adjacent UMIs in the directional algorithm. Default: 0.5.-t
: parallelize the deduplication of each separate alignment position. Using this is discouraged as it is lacking many features. Default: false.-T
: parallelize the deduplication of one single alignment position. The data structure can only be naive
, bktree
, and fenwickbktree
. Using this is discouraged as it is lacking many features. Default: false.--umi-sep
: separator string between the UMI and the rest of the read header. Default: _
.--algo
: deduplication algorithm. Either cc
for connected components, adj
for adjacency, or dir
for directional. Default: dir
.--merge
: method for identifying which UMI to keep out of every two UMIs. Either any
, avgqual
, or mapqual
. Default: mapqual
for SAM/BAM mode, avgqual
for FASTQ mode.--data
: data structure used in deduplication. Either naive
, combo
, ngram
, delete
, trie
, bktree
, sortbktree
, ngrambktree
, sortngrambktree
, or fenwickbktree
. Default: ngrambktree
.--two-pass
: use a separate two-pass algorithm for SAM/BAM deduplication. This may be slightly slower, but it should use much less memory if the reads are approximately sorted by alignment coordinate. Default: false.--paired
: use paired-end mode, which deduplicates pairs of reads from a SAM/BAM file. The template length of each read pair, along with the alignment coordinate and UMI of the forwards read, are used to deduplicate read pairs. This is very memory intensive, and the input SAM/BAM files should be sorted. Default: false (single-end).--remove-unpaired
: remove unpaired reads during paired-end mode. Default: false.--remove-chimeric
: remove chimeric reads (pairs map to different references) during paired-end mode. Default: false.--keep-unmapped
: keep unmapped reads (no paired-end mode). Default: false.--tag
: tag reads that belong to the same group without removing them. In fastq
mode, this will append cluster_id=[unique ID for all reads of the same cluster]
to the header of every read. cluster_size=[number of reads in the cluster]
will only be appended to the header of a consensus read for an entire group/cluster. same_umi=[number of reads with the same UMI]
will be appended to the header of the "best" read of a group of reads with the exact same UMI (not allowing mismatches). In sam
/bam
mode, then all reads but the consensus reads will be marked with the duplicate flag. The MI
attribute will be set with the cluster_id
and the RX
attribute will be set with the UMI of the consensus read. If applicable, the cs
attribute is set with the cluster_size
, and the su
attribute is set with the same_umi
count. For paired-end reads, only the forwards reads are tagged. This does not work with the --two-pass
feature.If you need more memory to process larger datasets, then modify the umicollapse
file. -Xms
represents the initial heap size, -Xmx
represents the max heap size, and -Xss
represents the stack size. If you do not know how much memory is needed, it may be a good idea to set a small initial heap size, and a very large max heap size, so the heap can grow when necessary. If memory usage is still is an issue, use the --two-pass
option to save memory when the reads are approximately sorted (this is not a strict requirement, its just that when reads with the same alignment coordinate are close together in the file, they do not have to be kept in memory for very long).
Please open an issue if you have any questions/bugs/suggestions!