Closed bcorrie closed 2 years ago
@javh @scharch would one of you be able to update the "description" field in the spec for these two counts. I don't know what the right wording should be as per the discussion here: https://github.com/airr-community/airr-standards/issues/543#issuecomment-1035332041
The old descriptions (which are still in the spec) are:
umi_count:
type: integer
description: Number of Rearrangement records (sequences) included in this clone.
clone_count:
type: integer
description: Non-normalized absolute count of the number of members (immune cells) in this clone.
I think maybe the clone_count is OK, but the umi_count I am pretty sure needs an update.
If you want you could add duplicate_count and a description as well... 8-)
Thanks, @bcorrie. I pushed some updates for us to hammer on. I also added umi_count
to Rearrangement and changes some of the wording in there.
One of the things we need to figure out is whether clone_count
should be the count of:
I'm certain we talked about this before - I'll try to find where later.
I put the description for clone_count
back to its original, which is to imply that the clonal assignment tool, or companion tool, decides what is the clonal abundance (count).
@schristley, I don't think the original description for clone_count
is correct (which is why I changed it). Two examples where this is plainly wrong are:
In both cases, it's a best-effort to approximate cell count, but it's also not cell count nor raw counts.
@schristley, I don't think the original description for
clone_count
is correct (which is why I changed it). Two examples where this is plainly wrong are:
@javh Okay, but your description makes it sound like it's just a count of rearrangement records. The original intent #471 of clone_abundance
which got renamed to clone_count
was to record what the analysis tool (and/or experimental protocol) inferred to be the clone size. "Non-normalized absolute" is meant to indicate that a frequency shouldn't be provided, as other analysis might want to perform their own normalization. Is this better?
Non-normalized absolute count of the inferred number of members (immune cells) in this clone.
"just a count of rearrangement records" will often be correct. Whether it's total sequences or unique sequences will depend upon whether duplicates were removed before calculating clone_count
.
"Non-normalized", "inferred" and "immune cells" are all problematic language.
Some calculations to cover for clone_count
:
cell_id
. In practice this is the same as (1), but with cell_id
grouping.These are the most common ones that occur to me; excluding statistical inference methods (model fitting, unseen species correction, etc).
My wording didn't cover all these cases either...
"just a count of rearrangement records" will often be correct. Whether it's total sequences or unique sequences will depend upon whether duplicates were removed before calculating
clone_count
.
That may be so, but if it's not always true then it's trouble because somebody like Brian is going to read it literally and implement it that way everywhere without considering cases when it isn't true.
"Non-normalized", "inferred" and "immune cells" are all problematic language.
How about something like this:
Absolute count of the size (number of members) of this clone in the repertoire. This could simply be the number of sequences (Rearrangement records) observed in this clone, or it may be a more sophisticated analysis calculation intertwined with an experimental protocol. Absolute count is provided versus a frequency so that downstream analysis tools can perform their own normalization. The standard frequency can be calculated by dividing by the clone_count sum for all clones in the repertoire.
@schristley, I like it. That seems to cover everything better than the original wording and my wording. We can probably drop the last sentence, as we don't need to provide suggestions for how to perform relative abundance calculations that aren't covered by a standard field.
How about this edit?
Absolute count of the size (number of members) of this clone in the repertoire. This could simply be the number of sequences (Rearrangement records) observed in this clone, the number of distinct cell barcodes (unique cell_id values), or a more sophisticated calculation appropriate to the experimental protocol. Absolute count is provided instead of a frequency so that downstream analysis tools can perform their own normalization.
We already starting making changes for #161 so we should make the other changes as well.
NOTE: This PR does not address #317
@javh there is a the Travis CI check that is not passing. Expected but waiting - is that expected?
Finally - synced all the specs so they pass consistency check 8-)
Hope I didn't screw up any definitions in the progress - @javh you are on for review, can you double check...
Looks good to me. Thanks, @bcorrie. I'll merge this.
That weird Travis "expected but waiting" thing should be gone in every PR now.
Also, we need to get in touch with 10x about the umi_count
field being in v1.4.
Closes #543 Closes #161