Closed williamdlees closed 5 years ago
Agreed on the IARC call today that we should change the definition ( and the code in genotype_statistics.R ) to make it clear that the count should be applied to exact matches only.
From Chaim Schramm:
I think the question is how much does iARC want to allow inferences from tools like partis and the new version of TIgGER that don't rely on the assumption that near-germline sequences are present in the data? Can I submit inferences made from sequencing sorted IgG memory cells of a chronically infected HIV donor? If so, there might well be few or no unique CDR3s with exact matches to the inferred germlines. Isn't that why we have the unmutated_frequency
field?
Will include both 'mutated' and 'unmutated' counts.
The definition of these fields does not restrict the count to sequences that exactly match the inferred germline sequence. The definition of 'unique CDR3s', for example, is 'Number of unique CDR3s found associated with an inferred V sequence'. However, a restriction to exact matches would be preferable. and a quick examination of submissions has shown that the majority of submissions do appear to restrict the count in this way.
Should we change the definition ( and the code in genotype_statistics.R ) to make it clear that the count should be applied to exact matches only?