CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

UMI dedup #544

Closed lkerr34 closed 2 years ago

lkerr34 commented 2 years ago

Hi and thanks you for such a useful tool!

I have used umi_tools extract to remove UMIs from some data and this seemed to have been successful. I then use Trimgalore to trim reads, bismark to align reads and samtools to sort and index the alignment.

However, when I use umi_tools dedup I am only left with only 89 of the original 36986 read pairs. I seen in the documentation that when UMIs are extracted using umi_tools extract, then the UMI is the last word in the read name. I have attached snippets of the output from umi_tools extract and my aligned file. In the output from umi tools the UMI is indeed at the end of the file name but in the alignment file the UMI is the 8 digits before the "_1:N:0..." in the read name so perhaps this is the problem. I have also attached the output I get from umi_tools dedup. Any help you could provide regarding this issue would be really appreciated!

Thanks! Lyndsay

image image image

IanSudbery commented 2 years ago

Yes. It appears that bismark is altering the readnames so that it is attaching the 1:N:0 part of the read header to the rest of the read name, and thus UMI-tools can't find the UMI. My feeling is that it would be best to remove this. Probably the easiest way is before alignment. I think:

$ zcat input_reads.fastq.gz | sed -E 's/ [12]:N:0:.+//' | gzip > reprocessed.fastq.gz

should do the trick, but it would be worth checking the output.

lkerr34 commented 2 years ago

Great--thank you so much for your help!