BUStools / bustools

Tools for working with BUS files
https://bustools.github.io/
BSD 2-Clause "Simplified" License
92 stars 23 forks source link

Discrepancy between bustools inspect and bustools count, particularly with short UMI sample #44

Open skannan4 opened 4 years ago

skannan4 commented 4 years ago

For several datasets I have pulled from the literature, the median UMIs per barcode predicted by bustools inspect is much higher than the actual median UMIs per barcode in the output of bustools count. For example, in one dataset, the predicted median was 25K UMIs/cell, but the final sparse matrix has a median of ~8K UMIs/cell.

I used bustools text to explore the sorted bus data a bit more carefully. I filtered out all ECs that could map to more than one gene to make sure it wasn't a multi-mapping issue. I then observed that for many of the barcodes, there were repeating UMIs mapping to completely different genes. Because the UMIs for these datasets are 8bp, I suspect that the same UMI bound to completely different molecules. Indeed, the discrepancy between inspect and counts is much higher for these datasets w/8bp than those w/10 or 12bp UMI. Completely removing UMIs that repeat for a given barcode seems to bring the number of UMIs/cell closer to the bustools count output.

While inspect appears to calculate the number of unique UMIs per barcode, I'm not sure I understand how count handles them - presumably, these are "bad" reads that are tossed? (Unfortunately, the verbal output of counts, e.g. bad, rescued, compacted wasn't available in my version). If that's the case, I wonder if count (and inspect, for that matter) are underestimating the number of unique molecules in the dataset. I might also be off-track here - happy to provide the bus files as a reproducible example if it would be helpful.

johan-gson commented 4 years ago

Since I was looking for something else and found this issue, I could just as well comment :) I have been digging a bit in that part of the code. count creates an intersection of all possible genes from all BUS records belonging to the same unique combination of UMI and barcode (i.e. the gene). When records map to completely different genes, the interesection likely becomes empty. This will lead to that they are not counted. In the code, this is a loop over the genes, that will never be entered since the set is empty.

skannan4 commented 4 years ago

Thanks for noting this! But then, am I wrong to think that this might erroneously throw out unique molecules in cases where the UMI is short? (Also, I need to do some more digging, because even in my experiments with 10bp UMI, which should be long enough to avoid this issue, I see significant discrepency between inspect and count, and I am not really sure why - though what you pointed out gives at least one possible explanation for the difference).

johan-gson commented 4 years ago

I think you are right, and I think it can be corrected for if you have enough reads per UMI. I'm thinking there are two cases when you have two (or more) groups of mappings (i.e. one group is mapping to one gene, the other to another, but they all have the same UMI and cell barcode):

  1. Almost all reads map to one gene, one to a different genes - in this case it is likely a mutation in the pcr or a faulty read. Discarding all (which is done now) is probably a bad choice, only the small group should be discarded.
  2. You have two groups of somewhat equal size (and short UMIs) - this suggests that the two groups should be treated as separate original molecules, meaning that it should result in two UMI counts instead of 0.

I am looking into a dataset with a lot of reads per cell, and there it looks like quite a few molecules are lost; the effect worsens with more reads though. But this is 10x with UMI length of 10, so the second effect is likely small (one in a million...)

DrHogart commented 4 years ago

I've faced with a very similar problem when the ratio UMI/barcode is significantly, more than in two orders, differed between bustools inspect and bustools count. The same dataset treated with salmon alevin has a much more reliable result. This discrepancy is not observed with small subsample (2.5k reads) but obviously can be seen with slightly larger subsamble (1M reads).

I've also tested the effect of UMI length on bus count (by changing kallisto bus -x 0,9,12:0,2,9:1,1,50 -> -x 0,9,12:0,0,9:1,1,50, so the length of UMI is changing 7 -> 9 nt), and indeed there was a slight, 3x-5x, increase of UMI/barcode ratio after increasing the UMI size. But high discperancy b/w 'bustools inspect' and bustools count is still observed.

droplet-lab commented 4 years ago

I note the same thing, which is even worsened with 6 bp UMIs. Let me know if you find a work-around.

skannan4 commented 4 years ago

I note the same thing, which is even worsened with 6 bp UMIs. Let me know if you find a work-around.

So for cases where it was a major problem, and where I didn't need multimappers, I just ended up manually counting. This was using the previous version of kallisto|bustools (haven't tested the new one yet), but I used bustools project followed by bustools text to get all of the mappings. I then loaded into R and counted there (basically using dcast). I do hope the authors address this issue tho - I would trust a true modification from them rather than this workaround from me.