Closed LeoVincenzi closed 1 year ago
Hi Leo,
The problem is that dedup
requires that reads in pairs have the same name, and that the read1 and read2 read names both contain the UMI for that pair (irrepsective of whether it was found in the read sequence of only read1, only read2 or in both).
To do this, you need to provide the file name for both the read1 and read2 files to extract
. You also need to run dedup
in paired mode with --paired
.
So you would extract with:
$ umi_tools extract -I trimmed2.fastq.gz --bc-pattern=NNNNNNNN --log=processed.log -S temp_R2.fastq.gz --read2-in=trimmed1.fastq.gz --read2-out=temp_R2.fastq.gz
and then the dedup would be:
$ umi_tools dedup -I Sample.sort.bam --log=log.out --paired -S deduplicated.bam
Note that we don't recommend running --output-stats
on full bam
s because it is very time consuming and memory hungry.
Great! That works very fast, indeed. Once I have performed the deduplication, how can I compute the duplication rate of my input bam file?
The log file should still output how many reads were input, and how many output, even without the detailed stats you get from --output-stats
.
If you do want the more detailed stats, I suggest running on only a single chromosome. You can do this with --chrom=chr22
for example.
Dear developers, I'm facing issues in applying umi_tools dedup on a bam alignment. I'm using data from a Takara SMARTer Stranded Total RNA-Seq Kit v3 library, so the UMI sequences are present only on the Read2 and the manual reports that 'The first 8 nt of the second sequencing read (Read 2) are UMIs followed by 3 nucleotides of UMI-linker and 3 nucleotides derived from the Pico v3 SMART UMI Adapter'.
After having performed the trimming with cutadapt of the Illumina adpters, I performed the extraction of the umi with umi_tools extract as follows:
umi_tools extract -I trimmed2.fastq.gz --bc-pattern=NNNNNNNN --log=processed.log -S temp_R2.fastq.gz
reformat.sh -Xmx2g in=temp_R2.fastq.gz out=trimmed2.UMI.fastq.gz forcetrimleft=6
I have then aligned trimmed1.fastq.gz and trimmed2.UMI.fastq.gz over a reference geneome with bowtie2 and then I wanted to perform deduplication, so I run:
umi_tools dedup -I Sample.sort.bam --log=log.out --output-stats=deduplicated -S deduplicated.bam
Anyway, I get the following error:If it helps, Ihere is how the bam file is formatted:
A01083:231:HTCHVDSX5:4:2573:21423:21778 101 LR862356.1 10182 0 * = 10182 0 CTCATACCGGGCTTCCGCTTCCGCTCGCTGTCCTCGTCCAGGTCTTCGTTCAGGGACCGTCCTGATCGCGATTTCCCAGTCAGCTGTCATCAAGTGGAAAATGGTGAGAAACCACGTGCGTGGCTTCCACACGCTCGGCGCGCACTAG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF YT:Z:UP A01083:231:HTCHVDSX5:4:1121:25843:19069_ACACGAGT 153 LR862356.1 10182 0 92M2I14M1I25M = 10182 0 TCGTTCAGGGACCGTCCTGATCGCGATTTCCCAGTCAGCTGTCATCAAGTGGAAAATGGTGAGAAACCACGTGCGTGGCTTCCACACGCTCGGCGCGCACTAGGCAGCTGAGGAGGAGCAGCGAGGCAGTGAGA FFFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF,FFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:-71 XN:i:0 XM:i:11 XO:i:2 XG:i:3 NM:i:14 MD:Z:32C10G0C4A6C0A11A16T2A9G7G23 YT:Z:UP
Any help would be apppreciated to better understand what I am facing. Thanks