Closed nickzoic closed 7 months ago
Actually looking at http://varnomen.hgvs.org/recommendations/protein/variant/substitution/ the syntax is a bit different, eg:
p.Trp24Cys
instead of p.24Trp>Cys
p.Arg76_Cys77delinsSerTrp
instead of p.76_77delinsSerTrp
or p.[Arg76Ser;Cys77Trp]
This might be a lot harder than I initially thought.
[sequence_align.pairwise.hirschberg](https://github.com/kensho-technologies/sequence_align)
library seems to do a good job, and can accept more than just sequences of characters, but its alignments need some post-processing to make them compatible with HGVS.I've made a start on this work at https://github.com/nickzoic/countess-variants
This got more-or-less-fixed-for-now in 51435699358536edb1d55f6b73f9ca1cc85581e5 (merged in v0.0.44) although a more sophisticated approach would be welcomed as a plugin.
At the moment, the Variant Translator plugin takes a reference sequence and does a reasonable job of turning sequences into "g."-type HGVS strings, but there's no support for protein variants or for grouping variations into triplets (eg: for coding sequences)
We could also do protein variant calling by translating before and after sequences to IUPAC single-letter protein sequences, then calling Levenshtein, then translating the single-letter codes to the IUPAC three letter codes used in HGVS.
It's likely that misalignment due to single base inserts or deletes will cause big changes at the protein level and get thrown out by the max_mutations limit.