SACGF / cdot

Transcript versions for HGVS libraries
MIT License
29 stars 5 forks source link

cdot responsible for fixing bad HGVS, allow warnings etc #27

Open davmlaw opened 1 year ago

davmlaw commented 1 year ago

There are plenty of bad HGVS strings out there, especially when people are typing into a search box - eg they put spaces in there, forget the colon, have unbalanced brackets etc.

VariantGrid has a lot of functionality to handle sloppy/bad HGVSs - mostly in search and HGVS Matcher

Ideally should move all of this functionality into cdot, so that it can be generally useful.

Would be nice to have a framework where you return a list of well structured warnings / errors etc.

davmlaw commented 1 year ago

It would be good to collect a huge test case of bad HGVSs (from search bars around the place) and then work out how to resolve them etc

TheMadBug commented 1 year ago

The two big issues we see in Shariant search (the examples aren't valid, just showing off the kinds of issues) :

davmlaw commented 1 year ago

Will run on each environment:

import socket
import pandas as pd
from django.db.models import Q
from eventlog.models import Event

hostname = socket.gethostname()
search_qs = Event.objects.filter(name='search')
search_hgvs = search_qs.filter(Q(details__icontains='c.') | Q(details__icontains=':c'))
df = pd.DataFrame.from_records(search_hgvs.values_list("date", "details"))

df.to_csv(f"/tmp/{hostname}_search_hgvs.csv", index=False)

Then collect them all together. Have put scripts in "paper" directory in cdot github

Emailed csv to James and myself to continue analysis (need to clean etc stuff from private servers before I share it)

davmlaw commented 1 year ago

Web developers know to clean their user text, but the main use case of cdot would be bioinformaticians hacking together scripts I think

We could run an evaluation of how many HGVSs resolve from the literature and ClinVar etc as well

davmlaw commented 1 year ago

Few thoughts:

At the moment 0 modification is done on HGVS import This change proposes to add 1 cleaning op On search, a number of cleaning ops are performed

Search currently works via:

Few ideas: