Closed grst closed 5 months ago
I'm glad you asked that question... 8-) Not...
Other might provide a more precise answer, but my understanding is that duplicate_count
doesn't apply to single-cell 10X style experiments - see https://github.com/airr-community/airr-standards/issues/543#issuecomment-1034020133
It is for use in non-UMI bulk studies if I understand correctly.
The different fields stem from trying to capture the different types of counts one might have with such varying techniques such as bulk and single-cell.
See the associated very long threads as to how this was arrived at. Perhaps we need to update the docs to better reflect this?
I am hoping that @javh @scharch might comment on your use of consensus_count
and umi_count
, it seems right to me but I am not an expert.
Yes, I think this is correct. See https://github.com/airr-community/airr-standards/issues/161#issuecomment-967277462 that duplicate_count
is intended for use when there are no UMIs in the experimental protocol.
duplicate_count
now remains empty (what is it actually for?)
Exact sequence duplicates, that is, same length and identical nucleotide sequence. Almost used exclusively by pre-processing tools for bulk AIRR-seq. As those duplicate sequences will have identical annotations from (say) IgBlast, it is used as an optimization to speed up the analysis workflow.
A further point. It is important for downstream analysis tools to be aware and use duplicate_count
, especially if they are performing counts or statistics that take the number of sequences into account.
Got it, thanks everyone!
Scirpy will use the umi_count
field by default from the next release on.
In the context of single-cell TCR data, I've always been a bit confused about what to put in which field. I've seen in the latest revision of the AIRR standard, a
umi_count
field has been added to resolve some of this ambiguity.Just to be sure I got this right:
umi_count
should contain the deduplicated read count (i.e. the number of unique UMIs)duplicate_count
now remains empty (what is it actually for?)consensus_count
should contain the raw read count (before UMI deduplication)Is this correct?
This came up in https://github.com/scverse/scirpy/issues/478