airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

AIRR could use a specification of mutations counts/frequencies and maybe aggregates #772

Open schristley opened 4 months ago

schristley commented 4 months ago

Ok, I admit it, @javh was so optimistic about being feature complete that I couldn't help tossing this out. ;-)

The specification is not particular hard as the Immcantation suite already has fields for when it calls the mutations, though it stops at the nucleotides (organized by codons according to IMGT numbering). With Repcalc, I've straightforwardly extended this to amino acids. Though I'm not particularly enamored with the shortened names, mu_count_5_r, mu_count_10_s_aa, etc. Not hard, I think, just lots and lots of fields. There are the count fields, then correspondingly frequency fields (for a single sequence), plus those corresponding fields for regions and the sequence as a whole.

What gets a little trickier is if we want to specify aggregates, for example the mutations counts/frequencies across all the sequences for a clone. The fields aren't that hard but there's more information that needs to be maintained in order to do the calculations correctly. In particular, not all sequences will have all codon positions (e.g. think CDR1 and CDR2 of various sizes among the sequences).

javh commented 4 months ago

Hehe. "Optimistic" might not be quite right... more like resignation to fate. :)