Add Per-UMI Read Coverage Information

Alf-Kenz commented 1 year ago

Hi,

This isn't an issue, more a question but thought it might as well be public in case it helps anyone else :-). I'm sorry for it being so long, I was trying to not take up your time with a long back and forth.

I've been reading the code through pretty carefully, it's really very nice and elegant! One thing I was thinking about was getting some sense of the raw read coverage per SNV, especially separated by UMI. For a given SNV, I had some hypothetical scenarios where I have decreasing amount of 'gut-level' confidence:

3 distinct UMI's, where each have 2+ reads covering the base, all of which agree on the mutation
2 distinct UMI's, where each only has 1 read covering the base, but those two 'independent' reads agree
1 UMI with say 15 reads, 14/15 of agree on a mutation
1 UMI with 2 reads both of which agree

I feel like as a layperson, in the first scenario, even if I only have a few total reads, the distinct UMI's and multiple (albeit few) unanimous reads per UMI gives me more confidence it's not sequencing artifact. But in the 3rd scenario, where I have a lot more reads but a single UMI, it could all stem from an early PCR artifact which then amplified.

In the collapsing stage, it seems like that information is lost but spelunking through the code, I noticed the commented out coverage metric which seemed to be exactly in this vein. Would it be possible for you to give more context for why it was ultimately removed? I added it back in (while also keeping track of the total number of reads for that base+UMI, just in case) here and then added some commits (here) in that same branch to propagate that information forward through the pileup. Does that seem like a reasonable approach?

I was also interested in keeping the UMI's separate for as long as possible so my successive pileup analysis could consider the information each independent UMI gives. Do you have any advice for how to do that? What I came up with here was to pass in a list of unique barcode+"_"+UMI strings instead of passed_barcodes.txt.gz and then do that same concatenation in the filter function. But that list is huge, many gigabytes. I tried to modify the BarcodeHash while reading in the bam when a white-listed barcode's has a new UMI but, I think b/c of the threading, it broke. I could use the parallel hashmap maybe? Do you think I'd be better still collapsing by barcode, but modifying some of the code to aggregate the individual UMI stats instead?

Any thoughts you have are really appreciated, I'm excited to use this program on a lot of data!

GWW commented 1 year ago

Hi @Alf-Kenz

I removed the coverage information because it was redundant with the XR tag, which gives the 0 based start and end position of the read and it's cigar string. I can probably add the original bam flag and generate an MD tag for each reads as well. This would let you recreate the sequence of each read from a given collapsed read. It wouldn't include the quality information though. You could probably parse each of those tags to recreate all of the reads for additional analyses.

I believe I have some C++ code somewhere to generate an MD tag from a given read alignment. I'll see if I can integrate it and let you know.

GWW commented 1 year ago

Another idea I could implement would be to add an option to emit a special bam file that is grouped by UMI. Ie. The collapsed read, followed by all of the reads merged into the collapsed read. The sorting on this file would be "wonky" because it would be ordered by gene and not compatible with other pileup methods that expect a sorted file.

Alf-Kenz commented 1 year ago

Hi @GWW,

That all makes sense to me, and the quality point is well taken. The way the current collapse takes the max qual made sense to me for the small reads p/ umi p/ base I'm currently seeing, but it definitely could be very helpful in modeling the SNV confidence to also know the sequencing quality.

I'm happy to also do some of the leg work on integrating the MD tag if it helps. For the special bam, I agree that some of the final QC stages for a putative SNV it would be very helpful to go back to the raw reads, and especially because of the small UMI/cell barcode mutations, it can be hard to get to the exact reads being collapsed.

I am also curious about your thoughts on keeping the UMI's separate all the way through the pileup, thinking more about the *_barcode_matrices.h5 sparse mtx file than the collapsed pileup file. If I pass each 'barcode' as cell_barcode+UMI barcode then since the reads are collapsed beforehand by UMI, there is basically no pileup.h5 grouping (e.g. h5 file -> base_T -> (.minus + .plus) = 1 for all rows). So I can then get the individual barcode+UMI's read coverage from above but everything is way slower and unweildy. And I have to be careful about filtering.

Do you think there's a path forward to still grouping within each of the cell barcodes but propagating the barcode's array of per-UMI coverage + potentially quality score (at a given base) through. Or it's best to just keep it split and deal with the slowdown and huge hash tables. I can cram it through by modifying some of the objects but to keep propagating but maybe that's worse of a hack

Alf-Kenz commented 1 year ago

Oh and of course I can't repeat myself enough, thanks for responding so fast and so nicely!

GWW commented 1 year ago

I am happy to help. I am glad people are using my tool.

The read collapsing may not be entirely necessary for your read case. The primary use for them was to look at multiple SNVs / RNA edits in close proximity etc.

I could probably add a command to pileup a specific region (or the whole file) from the merged.bam file and merge the UMI data to give you per barcode/molecule information. Perhaps a modified sparse matrix format, such as barcode_id:<base 1><qual 1><base 2><qual 2>;<base 1><qual 1><base 2><qual 2> where each semicolon separated list is from a single molecule.

The other option would be to make an alternative collapse command that takes a specific region as an argument and provides you with a collapsed read (using the reads overlapping that region) and their individual constituents.

GWW / scsnv

Add Per-UMI Read Coverage Information #26