Closed miktrinh closed 4 years ago
Thanks for noting the odd behaviour with self.start
and start
@mitrinh1. Having looked into this, it turns out this was an oversight on my behalf when the code was refactored to deal with single cell RNA-Seq data. Since self.start
is not updated, this also means that bundles don't get yielded until the contig changes which is not the intended behaviour. Whilst this behaviour ensures that no reads are missed, this is not the intended default behaviour, which is to yield bundles when the current read position is 1000 bp downstream of the bundle. In essence, I've done the equivalent hard-coding the --buffer-whole-contig
option to be on, which will have a negative impact on memory usage.
I'll run a few tests to see how big the impact might be and patch this promtly!
On your other points, you're correct that pos
takes account of soft clipping and that edit_distance
calculates the hamming distance
For completeness, here is the amount of memory required to process a test input (~2.5M paired end) using the 'current' code and the 'updated' code where self.start
updates properly and bundles are yielded as intended. With the intended behaviour, there's a small but clear reduction in memory use, as expected.
code | buffer_whole_contig | mem (kb) |
---|---|---|
Current | false | 1782296 |
Current | true | 1754400 |
Updated | false | 1381940 |
Updated | true | 1754756 |
Thank you so much for this! It is much appreciated!
Sorry, could you please quickly explain the rationale behind the differences in calculating self.start and pos for forward vs reverse reads in sam_methods.get_read_position()? It seems like for reverse reads, pos is considered as the end of alignment, whereas for forward reads, pos is where the alignment starts (ie. self.start + soft.clipped). Why is this?
Sure. There is no mystory to it - it would be better to calculate an adjusted position for the reverse read as well, but we can't, because we don't have the CIGAR string for the reverse read at the point where we need to calculate it.
Oh, sorry, you mean forward vs reverse, not read1 vs read2.
yes, as in forward and reverse alignments for single-end dataset, not paired-end
Why do we calculate start and pos seperately? We consider that if a read is 5' trimmed, this is more likely due to an error in the read than anything else. Thus the template that generated the read started at base 1 of the read, even if pos, the position recorded in the BAM file, and the order in which reads are sorted. Now think about "how do we know when we've seen all the reads that map to a position. Consider the following three reads:
Genome: 5'-------------------------------------------------------3'
read A: |>>>>>>>>>>>>>>|
read B: 5'|>>>>>>>>>>>>>>>>|3'
read C: ----|>>>>>>>>>>>|
We walk though the reads in the BAM in order. When we visit B, we see that we have advanced - the start position (both clipped and unclipped) is further on that A. We might naively assume that because B is after A, and reads are sorted, we've seen all the reads that have the same start as A, so we wait until we have seen some distance down stream before being confident we've seen everything.
Now consider a reverse read:
Genome: 5'-------------------------------------------------------3'
read A: |>>>>>>>>>>>>>>|
read D: |<<<<<|-----------------------------------|<<<|
read C: -----|>>>>>>>>>>|
^ start ^ pos
When grouping reads together for deduplication purposes, its the 5' end of the read that matters (which for a reverse read is at the 3' end of the alignment).
But we don't want to use the this to decide if we've seen enough reads to be confident we've seen everything associated with A to decide its not duplicated yet, because BAM files are ordered by the 5' most coordinate, not by the read start position. So we use the genomic start (the number recorded in read.pos) for this purpose.
Now, why don't we adjust that for softclipping etc? Our assumption is that while softclipping at the 5' end fo the read is most likely due to sequencing errors, softclipping at the 3' end could be due to a number of things, including untrimmed adaptors etc, so the uncorrected position is more reliable.
Hope that makes some sense.
Thank you so much for the detailed explanation! This is very clear and helpful, and I have understood the algorithm for read grouping now :) 👍
Hi all, thanks so much for developing this great tool!
I have some questions regarding the code of sam_methods.get_bundles class:
Also, am I correct in saying that the umi_tools._dedup_umi.edit_distance method calculates hamming distance?
Thank you in advance for your kind help! It is much appreciated!