Closed chrarnold closed 2 years ago
This surprised me when I read you message, but actaully I can think of situations where this would happen.
Firstly, are we talking about reads or about umis? If we are talking about reads: where UMItools finds two equivalent reads, it will pick one at random. Thus the results, at the read level, are not reproductible unless a random seed is set, and python is configured to make its dictionary lookup deterministic (it is not by default)
Secondly, if we are talking about UMIs, for each cluster of related UMIs, UMItools selects the most highly counted UMI to be the representative of that cluster. Its possible that an individual umi is not the mostly highly represented UMI in any individual cell, but when all the cells are added together, it is the most representative in the combined list.
Consider 4 cells:
cell1:
AAAA(x2)-AAAT(x1)
cell2:
AATT(x2)-AAAT(x1)
cell3:
ATAT(x2)-AAAT(x1)
cell4:
TAAT(x2)-AAAT(x1)
In a per-cell dedup, AAAA, AATT, ATAT and TAAT would be chosen. But without per-cell then the combined counts would be:
AAAT(x4)
AAAA(x2)
AATT(x2)
ATAT(x2)
TAAT(x2)
and AAAT would be chosen.
Thank you for this super prompt and very nice reply! I did not think about the randomness when picking a read, this may explain it well. This also means running dedup
a 2nd time for the same parameters may produce a different result if not setting any seed beforehand, correct? That is good to know.
In this case, we talk about reads, the default --extract-umi-method read_id
as I didnt specify this in the command line call at all.
Hi, we are facing an unexpected result for the following case: We have a sample subject to
dedup
that we prepared withextract
. Read headers look like this:VH00821:1:AAAVV2HM5:1:1211:52732:46435_1:N:0:TGTGCA_TCCATTCGG_TGGTCT
Here,
TGGTCT
is the UMI andTCCATTCGG
the cell barcode (here not cell per se, but doesnt matter).We run
dedup
twice, once withper-cell
(indicating the deduplication should be done per cell barcode and not globally) and once without to compare, default parameters otherwise. We expected to see additional reads surviving for theper-cell
version, and all reads that survived the default deduplication withoutper-cell
to survive also. The latter, however is not true: 1.8M reads appear only in the version withoutper-cell
, how can this be? Under which scenario would a read survive the default dedup filtering but NOT the per-cell dedup?We would like to understand, is this a bug? I did not find any information in the help or online.
Thanks!