CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
487 stars 190 forks source link

Deduplication of UMI on Read2 only #599

Closed LeoVincenzi closed 1 year ago

LeoVincenzi commented 1 year ago

Dear developers, I'm facing issues in applying umi_tools dedup on a bam alignment. I'm using data from a Takara SMARTer Stranded Total RNA-Seq Kit v3 library, so the UMI sequences are present only on the Read2 and the manual reports that 'The first 8 nt of the second sequencing read (Read 2) are UMIs followed by 3 nucleotides of UMI-linker and 3 nucleotides derived from the Pico v3 SMART UMI Adapter'.

After having performed the trimming with cutadapt of the Illumina adpters, I performed the extraction of the umi with umi_tools extract as follows: umi_tools extract -I trimmed2.fastq.gz --bc-pattern=NNNNNNNN --log=processed.log -S temp_R2.fastq.gz reformat.sh -Xmx2g in=temp_R2.fastq.gz out=trimmed2.UMI.fastq.gz forcetrimleft=6

I have then aligned trimmed1.fastq.gz and trimmed2.UMI.fastq.gz over a reference geneome with bowtie2 and then I wanted to perform deduplication, so I run: umi_tools dedup -I Sample.sort.bam --log=log.out --output-stats=deduplicated -S deduplicated.bam Anyway, I get the following error:

Traceback (most recent call last): File "/home/leonardo/miniconda3/bin/umi_tools", line 11, in sys.exit(main()) File "/home/leonardo/miniconda3/lib/python3.10/site-packages/umi_tools/umi_tools.py", line 66, in main module.main(sys.argv) File "/home/leonardo/miniconda3/lib/python3.10/site-packages/umi_tools/dedup.py", line 334, in main reads, umis, umi_counts = processor( File "/home/leonardo/miniconda3/lib/python3.10/site-packages/umi_tools/network.py", line 419, in call clusters = self.UMIClusterer(counts, threshold) File "/home/leonardo/miniconda3/lib/python3.10/site-packages/umi_tools/network.py", line 367, in call assert max(len_umis) == min(len_umis), ( AssertionError: not all umis are the same length(!): 37 - 39

If it helps, Ihere is how the bam file is formatted:

A01083:231:HTCHVDSX5:4:2573:21423:21778 101 LR862356.1 10182 0 * = 10182 0 CTCATACCGGGCTTCCGCTTCCGCTCGCTGTCCTCGTCCAGGTCTTCGTTCAGGGACCGTCCTGATCGCGATTTCCCAGTCAGCTGTCATCAAGTGGAAAATGGTGAGAAACCACGTGCGTGGCTTCCACACGCTCGGCGCGCACTAG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF YT:Z:UP A01083:231:HTCHVDSX5:4:1121:25843:19069_ACACGAGT 153 LR862356.1 10182 0 92M2I14M1I25M = 10182 0 TCGTTCAGGGACCGTCCTGATCGCGATTTCCCAGTCAGCTGTCATCAAGTGGAAAATGGTGAGAAACCACGTGCGTGGCTTCCACACGCTCGGCGCGCACTAGGCAGCTGAGGAGGAGCAGCGAGGCAGTGAGA FFFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF,FFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:-71 XN:i:0 XM:i:11 XO:i:2 XG:i:3 NM:i:14 MD:Z:32C10G0C4A6C0A11A16T2A9G7G23 YT:Z:UP

Any help would be apppreciated to better understand what I am facing. Thanks

IanSudbery commented 1 year ago

Hi Leo,

The problem is that dedup requires that reads in pairs have the same name, and that the read1 and read2 read names both contain the UMI for that pair (irrepsective of whether it was found in the read sequence of only read1, only read2 or in both).

To do this, you need to provide the file name for both the read1 and read2 files to extract. You also need to run dedup in paired mode with --paired.

So you would extract with:

$ umi_tools extract -I trimmed2.fastq.gz --bc-pattern=NNNNNNNN --log=processed.log -S temp_R2.fastq.gz --read2-in=trimmed1.fastq.gz --read2-out=temp_R2.fastq.gz

and then the dedup would be:

$ umi_tools dedup -I Sample.sort.bam --log=log.out --paired -S deduplicated.bam

Note that we don't recommend running --output-stats on full bams because it is very time consuming and memory hungry.

LeoVincenzi commented 1 year ago

Great! That works very fast, indeed. Once I have performed the deduplication, how can I compute the duplication rate of my input bam file?

IanSudbery commented 1 year ago

The log file should still output how many reads were input, and how many output, even without the detailed stats you get from --output-stats.

If you do want the more detailed stats, I suggest running on only a single chromosome. You can do this with --chrom=chr22 for example.