airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Clone updates #586

Closed bcorrie closed 2 years ago

bcorrie commented 2 years ago

Closes #543 Closes #161

bcorrie commented 2 years ago

@javh @scharch would one of you be able to update the "description" field in the spec for these two counts. I don't know what the right wording should be as per the discussion here: https://github.com/airr-community/airr-standards/issues/543#issuecomment-1035332041

The old descriptions (which are still in the spec) are:

        umi_count:
            type: integer
            description: Number of Rearrangement records (sequences) included in this clone.
        clone_count:
            type: integer
            description: Non-normalized absolute count of the number of members (immune cells) in this clone.

https://github.com/airr-community/airr-standards/blob/74e2733d40f00a11280236ba18a20ced39d2c40a/specs/airr-schema.yaml#L3763

I think maybe the clone_count is OK, but the umi_count I am pretty sure needs an update.

If you want you could add duplicate_count and a description as well... 8-)

javh commented 2 years ago

Thanks, @bcorrie. I pushed some updates for us to hammer on. I also added umi_count to Rearrangement and changes some of the wording in there.

One of the things we need to figure out is whether clone_count should be the count of:

  1. Total sequences.
  2. Unique sequence.
  3. Either of the above, depending upon tool and analysis context.

I'm certain we talked about this before - I'll try to find where later.

schristley commented 2 years ago

I put the description for clone_count back to its original, which is to imply that the clonal assignment tool, or companion tool, decides what is the clonal abundance (count).

javh commented 2 years ago

@schristley, I don't think the original description for clone_count is correct (which is why I changed it). Two examples where this is plainly wrong are:

  1. Classical bulk BCR sequencing clone sizes, which are often calculated as the number of unique variants.
  2. Adaptive's data which does not provide non-normalized counts - only counts normalized to their standard curves.

In both cases, it's a best-effort to approximate cell count, but it's also not cell count nor raw counts.

schristley commented 2 years ago

@schristley, I don't think the original description for clone_count is correct (which is why I changed it). Two examples where this is plainly wrong are:

@javh Okay, but your description makes it sound like it's just a count of rearrangement records. The original intent #471 of clone_abundance which got renamed to clone_count was to record what the analysis tool (and/or experimental protocol) inferred to be the clone size. "Non-normalized absolute" is meant to indicate that a frequency shouldn't be provided, as other analysis might want to perform their own normalization. Is this better?

Non-normalized absolute count of the inferred number of members (immune cells) in this clone.

javh commented 2 years ago

"just a count of rearrangement records" will often be correct. Whether it's total sequences or unique sequences will depend upon whether duplicates were removed before calculating clone_count.

"Non-normalized", "inferred" and "immune cells" are all problematic language.

javh commented 2 years ago

Some calculations to cover for clone_count:

  1. Total number of Rearrangements.
  2. Number of unique variants within the clone. Same as above if you've removed duplicates.
  3. Total number of unique cell_id. In practice this is the same as (1), but with cell_id grouping.
  4. Total number of raw reads and Adaptive's read count corrected by their standard curve. These two are the same thing w/ and w/out PCR amplification bias correction.

These are the most common ones that occur to me; excluding statistical inference methods (model fitting, unseen species correction, etc).

My wording didn't cover all these cases either...

schristley commented 2 years ago

"just a count of rearrangement records" will often be correct. Whether it's total sequences or unique sequences will depend upon whether duplicates were removed before calculating clone_count.

That may be so, but if it's not always true then it's trouble because somebody like Brian is going to read it literally and implement it that way everywhere without considering cases when it isn't true.

"Non-normalized", "inferred" and "immune cells" are all problematic language.

How about something like this:

Absolute count of the size (number of members) of this clone in the repertoire. This could simply be the number of sequences (Rearrangement records) observed in this clone, or it may be a more sophisticated analysis calculation intertwined with an experimental protocol. Absolute count is provided versus a frequency so that downstream analysis tools can perform their own normalization. The standard frequency can be calculated by dividing by the clone_count sum for all clones in the repertoire.

javh commented 2 years ago

@schristley, I like it. That seems to cover everything better than the original wording and my wording. We can probably drop the last sentence, as we don't need to provide suggestions for how to perform relative abundance calculations that aren't covered by a standard field.

How about this edit?

Absolute count of the size (number of members) of this clone in the repertoire. This could simply be the number of sequences (Rearrangement records) observed in this clone, the number of distinct cell barcodes (unique cell_id values), or a more sophisticated calculation appropriate to the experimental protocol. Absolute count is provided instead of a frequency so that downstream analysis tools can perform their own normalization.

schristley commented 2 years ago

We already starting making changes for #161 so we should make the other changes as well.

scharch commented 2 years ago

NOTE: This PR does not address #317

bcorrie commented 2 years ago

@javh there is a the Travis CI check that is not passing. Expected but waiting - is that expected?

bcorrie commented 2 years ago

Finally - synced all the specs so they pass consistency check 8-)

Hope I didn't screw up any definitions in the progress - @javh you are on for review, can you double check...

javh commented 2 years ago

Looks good to me. Thanks, @bcorrie. I'll merge this.

That weird Travis "expected but waiting" thing should be gone in every PR now.

Also, we need to get in touch with 10x about the umi_count field being in v1.4.