Open davmlaw opened 1 year ago
It would be good to collect a huge test case of bad HGVSs (from search bars around the place) and then work out how to resolve them etc
The two big issues we see in Shariant search (the examples aren't valid, just showing off the kinds of issues) :
nm_002342.3(BRCA2):C.5094-11G>A
NM_010934.4:c.3112A>G'
though I believe fixing that should probably be done at the application layer before it gets to c.dot as those kind of issues apply to all searches.Will run on each environment:
import socket
import pandas as pd
from django.db.models import Q
from eventlog.models import Event
hostname = socket.gethostname()
search_qs = Event.objects.filter(name='search')
search_hgvs = search_qs.filter(Q(details__icontains='c.') | Q(details__icontains=':c'))
df = pd.DataFrame.from_records(search_hgvs.values_list("date", "details"))
df.to_csv(f"/tmp/{hostname}_search_hgvs.csv", index=False)
Then collect them all together. Have put scripts in "paper" directory in cdot github
Emailed csv to James and myself to continue analysis (need to clean etc stuff from private servers before I share it)
Web developers know to clean their user text, but the main use case of cdot would be bioinformaticians hacking together scripts I think
We could run an evaluation of how many HGVSs resolve from the literature and ClinVar etc as well
Few thoughts:
At the moment 0 modification is done on HGVS import This change proposes to add 1 cleaning op On search, a number of cleaning ops are performed
Search currently works via:
Few ideas:
There are plenty of bad HGVS strings out there, especially when people are typing into a search box - eg they put spaces in there, forget the colon, have unbalanced brackets etc.
VariantGrid has a lot of functionality to handle sloppy/bad HGVSs - mostly in search and HGVS Matcher
Ideally should move all of this functionality into cdot, so that it can be generally useful.
Would be nice to have a framework where you return a list of well structured warnings / errors etc.