CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

Umitools Deduplication with STAR Alignment #637

Closed hserio closed 5 months ago

hserio commented 6 months ago

I am running into some issues trying to deduplicate Illumina paired-end reads after aligning with STAR. I have ran the same fastq files through hisat2 aligner and successfully deduplicated, so I am unsure what the issue is. I am using samtools sort and samtools index on my alignment files prior to deduplicating. Any help with this would be greatly appreciated!

This is what my code looks like:

umi_tools dedup -I "$BAM" --output-stats="$OUTPUT.stats" --paired -S "$OUTPUT"

This is the error I am getting:

Traceback (most recent call last): File "/umi_tools", line 11, in sys.exit(main()) File "/umi_tools/umi_tools.py", line 61, in main module.main(sys.argv) File "/umi_tools/dedup.py", line 262, in main for bundle, key, status in bundle_iterator(inreads): File "/umi_tools/sam_methods.py", line 375, in call read.reference_name != read.next_reference_name): File "pysam/calignedsegment.pyx", line 941, in pysam.calignedsegment.AlignedSegment.next_reference_name.get File "pysam/calignmentfile.pyx", line 1653, in pysam.calignmentfile.AlignmentFile.getrname File "pysam/calignmentfile.pyx", line 678, in pysam.calignmentfile.AlignmentFile.get_reference_name ValueError: reference_id -1 out of range 0<=tid<194

TomSmithCGAT commented 6 months ago

Hmmm... that's an odd error. I think that means the read is unmapped, in which case, it should stop working on that read here:

https://github.com/CGATOxford/UMI-tools/blame/7e799bc120f185128e3983cbc328e180a0b6b263/umi_tools/sam_methods.py#L351-L364

I note that your line numbers don't match the master branch (read.reference_name != read.next_reference_name): should be line 378 in sam_methods.py) and the file hasn't been changed in nearly 2 years, so I suspect you're on an old version of umi_tools. Could you please post the version of umi_tools and pysam you are using.

If you could generate and share a small example BAM file which throws this error, that will also speed up the debugging.

hserio commented 6 months ago

I am using umi_tools version 1.0.1 and pysam version 0.9.1 so maybe that could be the issue? When I use 'samtools view -c -f 4 <.bam>' there are 0 unmapped reads present.

Here is a link to an example .bam file that pulls the same error: https://drive.google.com/file/d/17oaR3Iqz4Nxj2lvlLkPxOmKNUIRsXxq2/view?usp=share_link

IanSudbery commented 6 months ago

This happens when a read is marked as mapped, but not assigned to a contig. Some aligners do this I think when the alignment go off the end of one contig and on to another (the genome is one continuous string internally.

Or i might be misremembering that I have seen examples of this before, although I'm not sure I remember where....

hserio commented 5 months ago

Updating umi-tools to 1.1.5 actually solved this issue! Thank you so much for the help.