airr-community / ogrdb

Website and associated database for managing submissions of inferred alleles
Other
8 stars 1 forks source link

Flag suspect artefacts in inferences #57

Closed williamdlees closed 4 years ago

williamdlees commented 5 years ago

from Andrew Collins:

The A85C SNP delivers the sequence CCCC, where C is the underlined nucleotide. The A152G SNP delivers the sequence GGGG, where G is the underlined nucleotide.

Interestingly, there are no examples of glycine at codon 51 (A152G) or proline at codon 29 (A85C). I have thought for a long time that an inferred amino acid change should be checked against all previously reported codons within the gene family under consideration. (It is surprising how little variation there is at each position, across gene families.) Could this easily be done as part of OGRDB? I would not rule out a previously unseen change, but it would raise a red flag. (Are there other checks that could routinely be run? What about flagging SNPs that involve RGYW/WRCY hotspot motifs? That is, where the difference is seen at the underlined G or C.)

Could OGRDB collect changes that we agree are errors, and run automatic checks? It seems likely that whatever gave rise to the A85C and A152G errors, they will happen again.

williamdlees commented 5 years ago

Altogether I think there are three potential checks mentioned here: