CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
491 stars 190 forks source link

sam_methods.get_bundle code #428

Closed miktrinh closed 4 years ago

miktrinh commented 4 years ago

Hi all, thanks so much for developing this great tool!

I have some questions regarding the code of sam_methods.get_bundles class:

Also, am I correct in saying that the umi_tools._dedup_umi.edit_distance method calculates hamming distance?

Thank you in advance for your kind help! It is much appreciated!

TomSmithCGAT commented 4 years ago

Thanks for noting the odd behaviour with self.start and start @mitrinh1. Having looked into this, it turns out this was an oversight on my behalf when the code was refactored to deal with single cell RNA-Seq data. Since self.start is not updated, this also means that bundles don't get yielded until the contig changes which is not the intended behaviour. Whilst this behaviour ensures that no reads are missed, this is not the intended default behaviour, which is to yield bundles when the current read position is 1000 bp downstream of the bundle. In essence, I've done the equivalent hard-coding the --buffer-whole-contig option to be on, which will have a negative impact on memory usage.

I'll run a few tests to see how big the impact might be and patch this promtly!

TomSmithCGAT commented 4 years ago

On your other points, you're correct that pos takes account of soft clipping and that edit_distance calculates the hamming distance

TomSmithCGAT commented 4 years ago

For completeness, here is the amount of memory required to process a test input (~2.5M paired end) using the 'current' code and the 'updated' code where self.start updates properly and bundles are yielded as intended. With the intended behaviour, there's a small but clear reduction in memory use, as expected.

code buffer_whole_contig mem (kb)
Current false 1782296
Current true 1754400
Updated false 1381940
Updated true 1754756
miktrinh commented 4 years ago

Thank you so much for this! It is much appreciated!

miktrinh commented 4 years ago

Sorry, could you please quickly explain the rationale behind the differences in calculating self.start and pos for forward vs reverse reads in sam_methods.get_read_position()? It seems like for reverse reads, pos is considered as the end of alignment, whereas for forward reads, pos is where the alignment starts (ie. self.start + soft.clipped). Why is this?

IanSudbery commented 4 years ago

Sure. There is no mystory to it - it would be better to calculate an adjusted position for the reverse read as well, but we can't, because we don't have the CIGAR string for the reverse read at the point where we need to calculate it.

IanSudbery commented 4 years ago

Oh, sorry, you mean forward vs reverse, not read1 vs read2.

miktrinh commented 4 years ago

yes, as in forward and reverse alignments for single-end dataset, not paired-end

IanSudbery commented 4 years ago

Why do we calculate start and pos seperately? We consider that if a read is 5' trimmed, this is more likely due to an error in the read than anything else. Thus the template that generated the read started at base 1 of the read, even if pos, the position recorded in the BAM file, and the order in which reads are sorted. Now think about "how do we know when we've seen all the reads that map to a position. Consider the following three reads:

Genome: 5'-------------------------------------------------------3'
read A:          |>>>>>>>>>>>>>>|
read B:            5'|>>>>>>>>>>>>>>>>|3'
read C:          ----|>>>>>>>>>>>|

We walk though the reads in the BAM in order. When we visit B, we see that we have advanced - the start position (both clipped and unclipped) is further on that A. We might naively assume that because B is after A, and reads are sorted, we've seen all the reads that have the same start as A, so we wait until we have seen some distance down stream before being confident we've seen everything.

Now consider a reverse read:

 Genome: 5'-------------------------------------------------------3'
 read A:          |>>>>>>>>>>>>>>|
 read D:            |<<<<<|-----------------------------------|<<<|
 read C:          -----|>>>>>>>>>>|
                    ^ start                                       ^ pos

When grouping reads together for deduplication purposes, its the 5' end of the read that matters (which for a reverse read is at the 3' end of the alignment).

But we don't want to use the this to decide if we've seen enough reads to be confident we've seen everything associated with A to decide its not duplicated yet, because BAM files are ordered by the 5' most coordinate, not by the read start position. So we use the genomic start (the number recorded in read.pos) for this purpose.

Now, why don't we adjust that for softclipping etc? Our assumption is that while softclipping at the 5' end fo the read is most likely due to sequencing errors, softclipping at the 3' end could be due to a number of things, including untrimmed adaptors etc, so the uncorrected position is more reliable.

Hope that makes some sense.

miktrinh commented 4 years ago

Thank you so much for the detailed explanation! This is very clear and helpful, and I have understood the algorithm for read grouping now :) 👍