Clarification on umi_count/duplicate_count/consensus_count

airr-community / airr-standards

AIRR Community Data Standards

https://docs.airr-community.org

Creative Commons Attribution 4.0 International

35 stars 23 forks source link

Clarification on umi_count/duplicate_count/consensus_count #743

Closed grst closed 5 months ago

grst commented 5 months ago

In the context of single-cell TCR data, I've always been a bit confused about what to put in which field. I've seen in the latest revision of the AIRR standard, a umi_count field has been added to resolve some of this ambiguity.

Just to be sure I got this right:

umi_count should contain the deduplicated read count (i.e. the number of unique UMIs)
duplicate_count now remains empty (what is it actually for?)
consensus_count should contain the raw read count (before UMI deduplication)

Is this correct?

This came up in https://github.com/scverse/scirpy/issues/478

bcorrie commented 5 months ago

I'm glad you asked that question... 8-) Not...

Other might provide a more precise answer, but my understanding is that duplicate_count doesn't apply to single-cell 10X style experiments - see https://github.com/airr-community/airr-standards/issues/543#issuecomment-1034020133

It is for use in non-UMI bulk studies if I understand correctly.

The different fields stem from trying to capture the different types of counts one might have with such varying techniques such as bulk and single-cell.

See the associated very long threads as to how this was arrived at. Perhaps we need to update the docs to better reflect this?

bcorrie commented 5 months ago

I am hoping that @javh @scharch might comment on your use of consensus_count and umi_count, it seems right to me but I am not an expert.

scharch commented 5 months ago

Yes, I think this is correct. See https://github.com/airr-community/airr-standards/issues/161#issuecomment-967277462 that duplicate_count is intended for use when there are no UMIs in the experimental protocol.

schristley commented 5 months ago

duplicate_count now remains empty (what is it actually for?)

Exact sequence duplicates, that is, same length and identical nucleotide sequence. Almost used exclusively by pre-processing tools for bulk AIRR-seq. As those duplicate sequences will have identical annotations from (say) IgBlast, it is used as an optimization to speed up the analysis workflow.

schristley commented 5 months ago

A further point. It is important for downstream analysis tools to be aware and use duplicate_count, especially if they are performing counts or statistics that take the number of sequences into account.

grst commented 5 months ago

Got it, thanks everyone! Scirpy will use the umi_count field by default from the next release on.