CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

Unexpected difference between dedup with and without per-cell #540

Closed chrarnold closed 2 years ago

chrarnold commented 2 years ago

Hi, we are facing an unexpected result for the following case: We have a sample subject to dedup that we prepared with extract. Read headers look like this: VH00821:1:AAAVV2HM5:1:1211:52732:46435_1:N:0:TGTGCA_TCCATTCGG_TGGTCT

Here, TGGTCT is the UMI and TCCATTCGG the cell barcode (here not cell per se, but doesnt matter).

We run dedup twice, once with per-cell (indicating the deduplication should be done per cell barcode and not globally) and once without to compare, default parameters otherwise. We expected to see additional reads surviving for the per-cell version, and all reads that survived the default deduplication without per-cell to survive also. The latter, however is not true: 1.8M reads appear only in the version without per-cell, how can this be? Under which scenario would a read survive the default dedup filtering but NOT the per-cell dedup?

We would like to understand, is this a bug? I did not find any information in the help or online.

Thanks!

IanSudbery commented 2 years ago

This surprised me when I read you message, but actaully I can think of situations where this would happen.

Firstly, are we talking about reads or about umis? If we are talking about reads: where UMItools finds two equivalent reads, it will pick one at random. Thus the results, at the read level, are not reproductible unless a random seed is set, and python is configured to make its dictionary lookup deterministic (it is not by default)

Secondly, if we are talking about UMIs, for each cluster of related UMIs, UMItools selects the most highly counted UMI to be the representative of that cluster. Its possible that an individual umi is not the mostly highly represented UMI in any individual cell, but when all the cells are added together, it is the most representative in the combined list.

Consider 4 cells:

cell1:        
AAAA(x2)-AAAT(x1)

cell2:
AATT(x2)-AAAT(x1)    

cell3:
ATAT(x2)-AAAT(x1)

cell4:
TAAT(x2)-AAAT(x1)

In a per-cell dedup, AAAA, AATT, ATAT and TAAT would be chosen. But without per-cell then the combined counts would be:

AAAT(x4)
AAAA(x2)
AATT(x2)
ATAT(x2)
TAAT(x2)

and AAAT would be chosen.

chrarnold commented 2 years ago

Thank you for this super prompt and very nice reply! I did not think about the randomness when picking a read, this may explain it well. This also means running dedup a 2nd time for the same parameters may produce a different result if not setting any seed beforehand, correct? That is good to know. In this case, we talk about reads, the default --extract-umi-method read_id as I didnt specify this in the command line call at all.