SACGF / variantgrid

VariantGrid public repo
Other
23 stars 2 forks source link

Structural Variant - symbolic variants #54

Closed davmlaw closed 1 month ago

davmlaw commented 4 years ago

We would like to store very large CNVs in the system. We currently can't read symbolic alts from VCFs (ie <DEL> or `````)

We can convert HGVS indels into very large ref/alt sequences, which has a number of problems, including display issues, and #15 - large sequences crashing VEP

So this task is to:


Further CNV issues have been split off:

EmmaTudini commented 3 years ago

Also note that different guidelines have been suggested for CNVs - see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7313390/. May need to allow for a different classification schema for CNVs. Will need to ask the labs how they interpret CNVs first.

EmmaTudini commented 3 years ago

See also https://github.com/SACGF/shariant-admin/wiki/Agilent-Alissa-CNVs

davmlaw commented 3 years ago

Need to integrate with systems for Shariant, collecting here: https://github.com/SACGF/shariant-admin/wiki/CNV-design

EmmaTudini commented 2 years ago

Riggs paper attached https://app.zenhub.com/files/299486514/069a8eca-8996-4ea5-a51d-b84279c61853/download

EmmaTudini commented 2 years ago

I think this part of the GA4GH VRS is related to CNVs - https://vrs.ga4gh.org/en/latest/terms_and_model.html#sequence-expression Might be helpful as a pseudo allele ID? May also help with normalisation - https://vrs.ga4gh.org/en/latest/impl-guide/normalization.html#normalization

Then I think VRSATILE is an extension of this... https://vrsatile.readthedocs.io/en/latest/index.html

Would need to organise a meeting with the leads to understand further

They have also done a bunch of work for somatic variants - see slides and recording under "fistful of categorical variants". https://docs.google.com/document/d/1M4izAS5e_iYUzEvEn2WaNqOX__-HcA4uEZBUWdnufOU/edit#heading=h.78t6rplrkaah

davmlaw commented 1 year ago

Benchmarks after adding Variant.end = IntegerField()

Laptop

Full reload of Trio analysis on affected mother 2008 trio: http://localhost:8000/analysis/10743/

18.45, 19.62, 19.7, 17.99, 19.5

CCB West

http://localhost:8000/analysis/53/ - full reload of 2008 trio w/patient pheno node

14.73, 14.59, 15.1, 14.76, 14.49 = 14.73 avg

After:

15.88, 15.61, 16.05, 17.02, 16.4 = 16.192

So it's around 10% slower, adding the extra DB fields

davmlaw commented 1 year ago

https://cnvar.org/resources/CNV-annotation-standards/#cnv-term-use-comparison-in-computational-fileschema-formats

davmlaw commented 1 year ago

User testing (ie Shariant)

From a user point of view - not much has changed.

There was a very big internal changes (75 files), which involves being able to load (from VCF) and store internally symbolic alleles eg <DEL> and <DUP> instead of explicitly storing huge ref/alt sequences.

So - need to check that you can resolve HGVS for del/dups, that they can be created, only 1 can be created, and they resolve etc etc.

I changed the "internal variant representation" which used to grow extremely large and now it looks like eg chr1:1000-2000 <DEL>

EmmaTudini commented 1 year ago

@TheMadBug Could we do a big upload from ClinVar of dels/dups to test this?

EmmaTudini commented 1 year ago

Testing: -Uploaded the following variants

  1. NM_000441.2(SLC26A4):c.1246_2340dup
  2. NM_006758.2(U2AF1):c.101_303del
  3. NM_025114.4(CEP290):c.125_1025del
  4. NM_007294.3:c.671_4096del - within one exon
  5. NM_007294.3(BRCA1):c.671_1650dup - less than 1k bases

Expected output: I think they should resolve (because they were before this change), but in the resolution, it should show the variant coordinates as or

Actual output: Failed - First 4 options resulted in errors below:

4598 UploadPipeline 7850 failed. Filename: VCF - Insert variants only (no samples etc) Error: No split VCF records. This is caused by pipeline error or empty VCF after cleaning

UploadPipeline 7863 failed. Filename: VCF - Insert variants only (no samples etc) Error: No split VC..

5th option worked and maintained the old variant coordinate

EmmaTudini commented 1 year ago

Another example - NM_000441.2(SLC26A4):c.1246_2341ins23 NM_000441.2(SLC26A4):c.1_705dup

davmlaw commented 12 months ago

I think this is because the VCF was written with an N:

Eg do classification import with "NM_006758.2(U2AF1):c.101_303del"

21  43095482    .   N   <DEL>   .   .   SVLEN=8864;SVTYPE=DEL

The upload had a message: "Warning: Skipped 1 'non-standard bases in REF sequence' records"

While if you create it from a variant search for "NM_006758.2(U2AF1):c.101_303del", it has the ref base in there:

21  44515592    .   G   <DEL>   .   .   SVLEN=8864;SVTYPE=DEL

This is because ImportedAlleleInfo works by storing the strings, and then populating the object that way, this doesn't always set the reference base:

In [15]: VariantCoordinate.from_string("12:88126354-88141011 <DEL>")
Out[15]: VariantCoordinate(chrom='12', start=88126354, end=88141011, ref='N', alt='<DEL>')
davmlaw commented 12 months ago

Fixed Emma's example above - Just needed to pass in genome build so that it would populate reference base if not there

EmmaTudini commented 3 months ago

Testing now recorded in private issue - https://app.zenhub.com/workspaces/everything-space-5bb3158c4b5806bc2beae448/issues/gh/sacgf/variantgrid_private/3645

davmlaw commented 1 month ago

This is one of the early SV issues - have been using/testing very thoroughly now onto very minor issues

tested eg NM_000441.2(SLC26A4):c.1246_2340dup works fine on VG test, loads with conservation now too in vep 112