airr-community / ogrdb

Website and associated database for managing submissions of inferred alleles
Other
8 stars 1 forks source link

Genotype statistics - counts of unique CDR3s, J and D calls #40

Closed williamdlees closed 5 years ago

williamdlees commented 5 years ago

The definition of these fields does not restrict the count to sequences that exactly match the inferred germline sequence. The definition of 'unique CDR3s', for example, is 'Number of unique CDR3s found associated with an inferred V sequence'. However, a restriction to exact matches would be preferable. and a quick examination of submissions has shown that the majority of submissions do appear to restrict the count in this way.

Should we change the definition ( and the code in genotype_statistics.R ) to make it clear that the count should be applied to exact matches only?

williamdlees commented 5 years ago

Agreed on the IARC call today that we should change the definition ( and the code in genotype_statistics.R ) to make it clear that the count should be applied to exact matches only.

williamdlees commented 5 years ago

From Chaim Schramm:

I think the question is how much does iARC want to allow inferences from tools like partis and the new version of TIgGER that don't rely on the assumption that near-germline sequences are present in the data? Can I submit inferences made from sequencing sorted IgG memory cells of a chronically infected HIV donor? If so, there might well be few or no unique CDR3s with exact matches to the inferred germlines. Isn't that why we have the unmutated_frequency field?

williamdlees commented 5 years ago

Will include both 'mutated' and 'unmutated' counts.