SACGF / variantgrid

VariantGrid public repo
Other
23 stars 2 forks source link

Classification import - cleaning HGVS #857

Open davmlaw opened 1 year ago

davmlaw commented 1 year ago

NOTE: Private discussion using data here: https://github.com/SACGF/variantgrid_private/issues/3492

We clean HGVS while searching to remove common errors.

In classification imports, we have a very simple clean (strip whitespace) in imported_c_hgvs

A clinvar variant in test has non-standard HGVS "dup23"

Current behavior was to just load this with PyHGVS, and then when you re-generate it, it doesn't have the dup23 on the end, eg:

Imported (GRCh38) NM_000127.2(EXT1):c.1315_1337dup23
Normalised (GRCh38) NM_000127.2(EXT1):c.1315_1337dup
Liftover (GRCh37) NM_000127.2(EXT1):c.1315_1337dup

I can't seem to see the "c.HGVS changed" flag - was it raised on this? Should it have been?

James says:

I don’t mind removing spaces from imported data, but making structural changes to imported values starts to get iffy. Would rather have it as a method on imported allele info - so you can get what was actually provided or the cleaned up version.

TheMadBug commented 1 year ago

Re the lack of a change flag - the CHGVSDiff, it specifically has a code for DIFF_RAW_CGVS_EXPANDED - which distinguishes between trailing numbers or nucleotides where one has them but the other doesn't, so these don't cause import warnings.

Likewise you can see the other import for that allele NM_000127.2(EXT1):c.1315_1337dupTCACGTAACAGTTTAATATGGAA was normalised to NM_000127.2(EXT1):c.1315_1337dup also without a flag.

The "23" suffix was from ClinVar (which isn't a lab we generally have to worry about in prod).

TheMadBug commented 1 year ago

The other question is if we'd want to store the fact that we resolved it from "NM_000127.2(EXT1):c.1315_1337dupTCACGTAACAGTTTAATATGGAA" or if we're happy enough to say we resolved it from "NM_000127.2(EXT1):c.1315_1337dup"

Do we currently check reference sequences for dup/dels? And if we do, do we do that if there's more than 10? If not, and we'd treat the following two the same

NM_000127.2(EXT1):c.1315_1337dupTTTTTTTTTTTTTTTTTTTTTTT
NM_000127.2(EXT1):c.1315_1337dupTCACGTAACAGTTTAATATGGAA

then I wouldn't see any benefit in us not cleaning it at the Allele Info stage and claiming it was imported as "NM_000127.2(EXT1):c.1315_1337dup". The distinction here between this and normalisation is we're purely formatting the data per the HGVS spec, almost the same as if we fixed a lower case nm.

Will discuss if there are any implications I haven't thought of.

davmlaw commented 1 year ago

See issue https://github.com/SACGF/variantgrid/issues/659

PyHGVS behavior is to use provided references, or load them if not present

We do a reference check afterwards

TheMadBug commented 1 year ago

Given that, (and since you only mentioned when the number of nucleotides is redundantly mentioned), are you only thinking of auto-cleaning NM_000127.2(EXT1):c.1315_1337dup23 to NM_000127.2(EXT1):c.1315_1337dup

but when it's nucleotides, leaving them as NM_000127.2(EXT1):c.1315_1337dupTCACGTAACAGTTTAATATGGAA ?

davmlaw commented 1 year ago

Yes, I just want to strip the dup23 at the moment.

If it's purely a suffix-strip, are you ok with just doing it in imported_c_hgvs, and add a note there about what is allowed to happen there (ie DIFF_RAW_CGVS_EXPANDED)

Then if we want to get more sophisticated about cleaning we can do it later properly?

EmmaTudini commented 1 year ago

@davmlaw When stripping the number at the end of dup, will it still do a check that the range specified by the chgvs is the same as the number? E.g. NM_000127.2(EXT1):c.1315_1337dup23 --> will it check that from 1315 to 1337 is 23 bases? This is what is currently done, and if it doesn't match, we generate an error and the whole variant fails.

See: Classification 104674 in test

davmlaw commented 1 year ago

I've fixed this for all biocommons HGVS resolution (search and import) - @EmmaTudini yes it will validate the length like in PyHGVS

The question of "how much HGVS cleaning do we do and where" is tricky. Maybe because it's something that worked historicaly with pyhgvs, and thus seems like a gap in Biocommons HGVS functionality I have done it in all the biocommons HGVS parsing pathways (silently, though it errors if the number doesn't match the span)

You will be able to see it as a c.HGVS import change (like it did with PyHGVS)

EmmaTudini commented 1 year ago

@davmlaw Do you know if this change made it in before James re-generated the file today? If the code should have been there, I think there might be something going on. The below still seem to generate errors according to the variant compare file

NM_000127.2(EXT1):c.1315_1337dup23 NM_023067.3(FOXL2):c.951_961dup11 NM_001354689.1(RAF1):c.1_2dup2 NM_023067.3(FOXL2):c.672_701dup30 NM_023067.3(FOXL2):c.907_926dup20

davmlaw commented 1 year ago

No, the CSV attached to the issue was done before this change was applied.

I think James has re-run it again since the change was applied but it hasn't been attached yet

EmmaTudini commented 11 months ago

Testing:

Deletion

  1. NM_000127.2(EXT1):c.1315_1337del23
  2. NM_000127.2(EXT1):c.1315_1337del24 Expected output: As above Actual output: Failed - no incorrect error span for option 4. It just resolved. See link to tool - https://test.shariant.org.au/classification/hgvs_resolution_tool?genome_build=GRCh37&hgvs=NM_000127.2%28EXT1%29%3Ac.1315_1337del24

Note: Imported into Shariant test using biocommons. Same issue as above

EmmaTudini commented 11 months ago

Testing - for whether specification of bases can detect whether the span is correct Dup

  1. NM_000127.2(EXT1):c.1315_1320dupTTTTTT - incorrect bases but correct span
  2. NM_000127.2(EXT1):c.1315_1320dupTTTTTTT - incorrect span
  3. NM_000127.2(EXT1):c.1315_1320delTCACGT- correct bases and span
  4. NM_000127.2(EXT1):c.1315_1320delTCACGTT- incorrect span

1 - should resolve to correct ref bases 2 - should say incorrect span 3 - should stay the same 4 - should say incorrect span

Actual output: Failed - incorrect span error not provided for 2 and 4

Note: Imported into Shariant test using biocommons - same issues as above

davmlaw commented 5 months ago

This has been fixed - however because we have cached ImportedAlleleInfo, historically wrong ones (eg the test cases) will still fail.

Using a new value during classification import test "NM_000127.2(EXT1):c.1315_1320delTCACGTTT" - gives Invalid HGVS name "Coordinate span (6) not equal to ref length 8"

I guess we will have to process historical data, perhaps raise some errors if need be


Using HGVS test tool on test cases above:

NM_000127.2(EXT1):c.1315_1320dupTTTTTTT

pyhgvs - Invalid HGVS name Coordinate span (6) not equal to ref length 7 biocommons - Calculated reference 'TCACGT' length (6) different from provided reference 'TTTTTTT' length (7)

NM_000127.2(EXT1):c.1315_1320delTCACGTT - Invalid HGVS name Coordinate span (6) not equal to ref length 7

pyhgvs - Invalid HGVS name "Coordinate span (6) not equal to ref length 7" biocommons - Calculated reference 'TCACGT' length (6) different from provided reference 'TCACGTT' length (7)

davmlaw commented 4 months ago

Can find these via:

from classification.models.classification_variant_info_models import ImportedAlleleInfo, ImportedAlleleInfoStatus
from functools import reduce
import operator
from django.db.models import Q

regex_string = r"\d+_\d+(del|dup)[GATC]+"
filters = [
    Q(imported_c_hgvs__regex=regex_string),
    Q(imported_g_hgvs__regex=regex_string)
]
q = ~Q(status=ImportedAlleleInfoStatus.FAILED) & reduce(operator.or_, filters)
qs.filter(q).count()

Then we need to actually re-run the HGVS and check whether it fails. Then send those through to

from classification.classification_import import reattempt_variant_matching
from library.guardian_utils import admin_bot

qs = [] # Get the ones that fail hgvs matching

reattempt_variant_matching(admin_bot(), qs, clear_existing=True)
davmlaw commented 4 months ago

Made management command + manual task to fix these

EmmaTudini commented 2 months ago

@davmlaw is this restricted to biocommons? If so, I'll keep testing for it to the next deploy

davmlaw commented 2 months ago

@EmmaTudini - this works for PyHGVS and Biocommons HGVS

EmmaTudini commented 2 months ago

Testing - pyhgvs (currently enabled in test)

EmmaTudini commented 2 months ago

Testing – Biocommons

  1. NM_004415.2(DSP):c.1_45del45 - Passed
  2. NM_004415.2(DSP):c.1_45dup45 - Passed
  3. NM_004415.2(DSP):c.1_45del44 - Failed - does not fail matching with incorrect deletion length
  4. NM_004415.2(DSP):c.1_45dup44 – Passed
  5. NM_004415.2(DSP):c.1_45del ATGAGCTGCAACGGAGGCTCCCACCCGCGGATCAACACTCTGGGC - Passed
  6. NM_004415.2(DSP):c.1_45dupATGAGCTGCAACGGAGGCTCCCACCCGCGGATCAACACTCTGGGC – Passed
  7. NM_004415.2(DSP):c.1_45del ATGAGCTGCAACGGAGGTCCCACCCGCGGATCAACACTCTGGGC – Passed
  8. NM_004415.2(DSP):c.1_45dupTGAGCTGCAACGGAGGCTCCCACCCGCGGATCAACACTCTGGGC – Passed

Note: Bases were not checked for del/dup in test 7 and 8 – See issue https://app.zenhub.com/workspaces/everything-space-5bb3158c4b5806bc2beae448/issues/gh/sacgf/variantgrid/1096

davmlaw commented 2 months ago

Main problem is that checks are not done in matching process

PyHGVS passed above, just biocommons Seems to miss the integer length check for dels

Can wait till next deploy

davmlaw commented 2 months ago

I think I just blindly fixed for the "dup" examples and didn't do the del. I have also done INV

The other ones with coordinate spans - ins and 'delisn' are already handled via "Insertions require inserted sequence, not an integer length"

NM_004415.2(DSP):c.1_1051dup25448 – should fail with incorrect length. Should be 25445

Pretty sure pyHGVS is wrong here - it's using genome coordinates

If you enter "NM_004415.2(DSP):c.1_1051dup1051" it fails in PyHGVS