Open skannan4 opened 4 years ago
Since I was looking for something else and found this issue, I could just as well comment :) I have been digging a bit in that part of the code. count creates an intersection of all possible genes from all BUS records belonging to the same unique combination of UMI and barcode (i.e. the gene). When records map to completely different genes, the interesection likely becomes empty. This will lead to that they are not counted. In the code, this is a loop over the genes, that will never be entered since the set is empty.
Thanks for noting this! But then, am I wrong to think that this might erroneously throw out unique molecules in cases where the UMI is short? (Also, I need to do some more digging, because even in my experiments with 10bp UMI, which should be long enough to avoid this issue, I see significant discrepency between inspect
and count
, and I am not really sure why - though what you pointed out gives at least one possible explanation for the difference).
I think you are right, and I think it can be corrected for if you have enough reads per UMI. I'm thinking there are two cases when you have two (or more) groups of mappings (i.e. one group is mapping to one gene, the other to another, but they all have the same UMI and cell barcode):
I am looking into a dataset with a lot of reads per cell, and there it looks like quite a few molecules are lost; the effect worsens with more reads though. But this is 10x with UMI length of 10, so the second effect is likely small (one in a million...)
I've faced with a very similar problem when the ratio UMI/barcode is significantly, more than in two orders, differed between bustools inspect
and bustools count
. The same dataset treated with salmon alevin
has a much more reliable result. This discrepancy is not observed with small subsample (2.5k reads) but obviously can be seen with slightly larger subsamble (1M reads).
I've also tested the effect of UMI length on bus count
(by changing kallisto bus -x 0,9,12:0,2,9:1,1,50
-> -x 0,9,12:0,0,9:1,1,50
, so the length of UMI is changing 7 -> 9 nt), and indeed there was a slight, 3x-5x, increase of UMI/barcode ratio after increasing the UMI size. But high discperancy b/w 'bustools inspect' and bustools count
is still observed.
I note the same thing, which is even worsened with 6 bp UMIs. Let me know if you find a work-around.
I note the same thing, which is even worsened with 6 bp UMIs. Let me know if you find a work-around.
So for cases where it was a major problem, and where I didn't need multimappers, I just ended up manually counting. This was using the previous version of kallisto|bustools (haven't tested the new one yet), but I used bustools project followed by bustools text to get all of the mappings. I then loaded into R and counted there (basically using dcast). I do hope the authors address this issue tho - I would trust a true modification from them rather than this workaround from me.
For several datasets I have pulled from the literature, the median UMIs per barcode predicted by
bustools inspect
is much higher than the actual median UMIs per barcode in the output ofbustools count
. For example, in one dataset, the predicted median was 25K UMIs/cell, but the final sparse matrix has a median of ~8K UMIs/cell.I used
bustools text
to explore the sorted bus data a bit more carefully. I filtered out all ECs that could map to more than one gene to make sure it wasn't a multi-mapping issue. I then observed that for many of the barcodes, there were repeating UMIs mapping to completely different genes. Because the UMIs for these datasets are 8bp, I suspect that the same UMI bound to completely different molecules. Indeed, the discrepancy betweeninspect
andcounts
is much higher for these datasets w/8bp than those w/10 or 12bp UMI. Completely removing UMIs that repeat for a given barcode seems to bring the number of UMIs/cell closer to thebustools count
output.While
inspect
appears to calculate the number of unique UMIs per barcode, I'm not sure I understand howcount
handles them - presumably, these are "bad" reads that are tossed? (Unfortunately, the verbal output ofcounts
, e.g. bad, rescued, compacted wasn't available in my version). If that's the case, I wonder ifcount
(andinspect
, for that matter) are underestimating the number of unique molecules in the dataset. I might also be off-track here - happy to provide the bus files as a reproducible example if it would be helpful.