Support for Variant Calling

VariantEffect / mavehgvs

A specification and Python implementation for representing variants from Multiplexed Assays of Variant Effect.

BSD 3-Clause "New" or "Revised" License

11 stars 2 forks source link

Support for Variant Calling #30

Open nickzoic opened 1 year ago

nickzoic commented 1 year ago

mavehgvs supports construction of variant strings, but it'd also be helpful to add routines to do "variant calling", eg: turn a pair of sequences into a valid MAVE-HGVS variant string.

The full HGVS standard is very complicated and identifying insertions from well-known sequences would require large storage and computing resources BUT calling just simple insertions, deletions, duplications and substitutions would potentially be useful for a number of applications.

Examples elsewhere:

... but it'd make sense to move this functionality into mavehgvs.

afrubin commented 1 year ago

This sounds like a good addition to mavehgvs as long as it doesn't add any dependencies with difficult build requirements. The primary use case for mavehgvs is to power the MaveDB server-side variant validation, so we don't want to complicate that installation.

Since mavehgvs currently only supports substitutions and small, simple indel events anyways, it's very sensible to include a variant caller that can do that efficiently in the package and I think it would be broadly useful.

When https://github.com/VariantEffect/mavehgvs/issues/20 gets implemented, this will also mean that other tools can easily consume the called variants without even needing to deal with mavehgvs Variant objects.

genomematt commented 1 year ago

Also worth noting that what pebbles does can be done as expand to two sequences then call. I skip the full expansion as it’s a bit more efficient.

https://github.com/genomematt/pebbles/blob/a810fb12b2858852468cfc6c7e79b87570920a07/src/pebbles/pebbles.py#L78

using a new common function for calling from aligned sequences may be better in the long run

nickzoic commented 1 year ago

Yep, @genomematt I don't think it's at all strange to go straight from CIGAR to HGVS, the Levenshtein algorithm does a good job of finding the edits but that information is already in CIGAR so might as well use it.

In CountESS, I've switched to using rapidfuzz.distance.Levenshtein which is what Levenshtein was using under the hood anyway. It's a C++ extension but looks like it has the expected support from conda, wheels, etc & I was able to get it to install under nix too.