cdot responsible for fixing bad HGVS, allow warnings etc

SACGF / cdot

Transcript versions for HGVS libraries

MIT License

29 stars 5 forks source link

cdot responsible for fixing bad HGVS, allow warnings etc #27

Open davmlaw opened 1 year ago

davmlaw commented 1 year ago

There are plenty of bad HGVS strings out there, especially when people are typing into a search box - eg they put spaces in there, forget the colon, have unbalanced brackets etc.

VariantGrid has a lot of functionality to handle sloppy/bad HGVSs - mostly in search and HGVS Matcher

Ideally should move all of this functionality into cdot, so that it can be generally useful.

Would be nice to have a framework where you return a list of well structured warnings / errors etc.

davmlaw commented 1 year ago

It would be good to collect a huge test case of bad HGVSs (from search bars around the place) and then work out how to resolve them etc

TheMadBug commented 1 year ago

The two big issues we see in Shariant search (the examples aren't valid, just showing off the kinds of issues) :

Incorrect case (both the NM and the c) e.g. nm_002342.3(BRCA2):C.5094-11G>A
Trailing quotes or leading tabs NM_010934.4:c.3112A>G' though I believe fixing that should probably be done at the application layer before it gets to c.dot as those kind of issues apply to all searches.

davmlaw commented 1 year ago

Will run on each environment:

import socket
import pandas as pd
from django.db.models import Q
from eventlog.models import Event

hostname = socket.gethostname()
search_qs = Event.objects.filter(name='search')
search_hgvs = search_qs.filter(Q(details__icontains='c.') | Q(details__icontains=':c'))
df = pd.DataFrame.from_records(search_hgvs.values_list("date", "details"))

df.to_csv(f"/tmp/{hostname}_search_hgvs.csv", index=False)

Then collect them all together. Have put scripts in "paper" directory in cdot github

Emailed csv to James and myself to continue analysis (need to clean etc stuff from private servers before I share it)

davmlaw commented 1 year ago

Web developers know to clean their user text, but the main use case of cdot would be bioinformaticians hacking together scripts I think

We could run an evaluation of how many HGVSs resolve from the literature and ClinVar etc as well

davmlaw commented 1 year ago

Few thoughts:

At the moment 0 modification is done on HGVS import This change proposes to add 1 cleaning op On search, a number of cleaning ops are performed

Search currently works via:

Cleaning is automatically done, but ad-hoc
Messages are put into the search result object
Exceptions are returned and stop further search - can be rendered as warnings/errors

Few ideas:

If you add a general "clean hgvs" method - it would be good to be able to pass the subset of cleaning operations you want done, or maybe even expose the individual functions or cleaning classes however we do it