Closed davmlaw closed 1 month ago
Also note that different guidelines have been suggested for CNVs - see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7313390/. May need to allow for a different classification schema for CNVs. Will need to ask the labs how they interpret CNVs first.
Need to integrate with systems for Shariant, collecting here: https://github.com/SACGF/shariant-admin/wiki/CNV-design
Riggs paper attached https://app.zenhub.com/files/299486514/069a8eca-8996-4ea5-a51d-b84279c61853/download
I think this part of the GA4GH VRS is related to CNVs - https://vrs.ga4gh.org/en/latest/terms_and_model.html#sequence-expression Might be helpful as a pseudo allele ID? May also help with normalisation - https://vrs.ga4gh.org/en/latest/impl-guide/normalization.html#normalization
Then I think VRSATILE is an extension of this... https://vrsatile.readthedocs.io/en/latest/index.html
Would need to organise a meeting with the leads to understand further
They have also done a bunch of work for somatic variants - see slides and recording under "fistful of categorical variants". https://docs.google.com/document/d/1M4izAS5e_iYUzEvEn2WaNqOX__-HcA4uEZBUWdnufOU/edit#heading=h.78t6rplrkaah
Benchmarks after adding Variant.end = IntegerField()
Laptop
Full reload of Trio analysis on affected mother 2008 trio: http://localhost:8000/analysis/10743/
18.45, 19.62, 19.7, 17.99, 19.5
CCB West
http://localhost:8000/analysis/53/ - full reload of 2008 trio w/patient pheno node
14.73, 14.59, 15.1, 14.76, 14.49 = 14.73 avg
After:
15.88, 15.61, 16.05, 17.02, 16.4 = 16.192
So it's around 10% slower, adding the extra DB fields
User testing (ie Shariant)
From a user point of view - not much has changed.
There was a very big internal changes (75 files), which involves being able to load (from VCF) and store internally symbolic alleles eg <DEL>
and <DUP>
instead of explicitly storing huge ref/alt sequences.
So - need to check that you can resolve HGVS for del/dups, that they can be created, only 1 can be created, and they resolve etc etc.
I changed the "internal variant representation" which used to grow extremely large and now it looks like eg chr1:1000-2000 <DEL>
@TheMadBug Could we do a big upload from ClinVar of dels/dups to test this?
Testing: -Uploaded the following variants
Expected output: I think they should resolve (because they were before this change), but in the resolution, it should show the variant coordinates as or
Actual output: Failed - First 4 options resulted in errors below:
UploadPipeline 7863 failed. Filename: VCF - Insert variants only (no samples etc) Error: No split VC..
No split VCF records. This is caused by pipeline error or empty VCF after cleaning
5th option worked and maintained the old variant coordinate
Another example - NM_000441.2(SLC26A4):c.1246_2341ins23 NM_000441.2(SLC26A4):c.1_705dup
I think this is because the VCF was written with an N:
Eg do classification import with "NM_006758.2(U2AF1):c.101_303del"
21 43095482 . N <DEL> . . SVLEN=8864;SVTYPE=DEL
The upload had a message: "Warning: Skipped 1 'non-standard bases in REF sequence' records"
While if you create it from a variant search for "NM_006758.2(U2AF1):c.101_303del", it has the ref base in there:
21 44515592 . G <DEL> . . SVLEN=8864;SVTYPE=DEL
This is because ImportedAlleleInfo works by storing the strings, and then populating the object that way, this doesn't always set the reference base:
In [15]: VariantCoordinate.from_string("12:88126354-88141011 <DEL>")
Out[15]: VariantCoordinate(chrom='12', start=88126354, end=88141011, ref='N', alt='<DEL>')
Fixed Emma's example above - Just needed to pass in genome build so that it would populate reference base if not there
Testing now recorded in private issue - https://app.zenhub.com/workspaces/everything-space-5bb3158c4b5806bc2beae448/issues/gh/sacgf/variantgrid_private/3645
This is one of the early SV issues - have been using/testing very thoroughly now onto very minor issues
tested eg NM_000441.2(SLC26A4):c.1246_2340dup works fine on VG test, loads with conservation now too in vep 112
We would like to store very large CNVs in the system. We currently can't read symbolic alts from VCFs (ie```)
<DEL>
or ``We can convert HGVS indels into very large ref/alt sequences, which has a number of problems, including display issues, and #15 - large sequences crashing VEP
So this task is to:
Further CNV issues have been split off:
824 CNV Callers and VCF processing
818 CNV Normalization and Representation
823 CNV Annotation - separate pipeline for large indels