davmlaw commented 4 years ago

We would like to store very large CNVs in the system. We currently can't read symbolic alts from VCFs (ie <DEL> or `````)

We can convert HGVS indels into very large ref/alt sequences, which has a number of problems, including display issues, and #15 - large sequences crashing VEP

So this task is to:

Support symbolic alts as the database representation (for >1kb size variants)
Convert existing >1kb to symbolics
Large CNVs appear ok on pages rather than dumping kilobytes of sequces on the page (find existing issue for this?)
Parse symbolics from VCF
Make sure all existing functionality eg HGVS still works

Further CNV issues have been split off:

824 CNV Callers and VCF processing
818 CNV Normalization and Representation
823 CNV Annotation - separate pipeline for large indels
CNV Classification will do later

EmmaTudini commented 3 years ago

Also note that different guidelines have been suggested for CNVs - see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7313390/. May need to allow for a different classification schema for CNVs. Will need to ask the labs how they interpret CNVs first.

EmmaTudini commented 3 years ago

davmlaw commented 3 years ago

Need to integrate with systems for Shariant, collecting here: https://github.com/SACGF/shariant-admin/wiki/CNV-design

EmmaTudini commented 2 years ago

Riggs paper attached https://app.zenhub.com/files/299486514/069a8eca-8996-4ea5-a51d-b84279c61853/download

EmmaTudini commented 2 years ago

I think this part of the GA4GH VRS is related to CNVs - https://vrs.ga4gh.org/en/latest/terms_and_model.html#sequence-expression Might be helpful as a pseudo allele ID? May also help with normalisation - https://vrs.ga4gh.org/en/latest/impl-guide/normalization.html#normalization

Then I think VRSATILE is an extension of this... https://vrsatile.readthedocs.io/en/latest/index.html

Would need to organise a meeting with the leads to understand further

They have also done a bunch of work for somatic variants - see slides and recording under "fistful of categorical variants". https://docs.google.com/document/d/1M4izAS5e_iYUzEvEn2WaNqOX__-HcA4uEZBUWdnufOU/edit#heading=h.78t6rplrkaah

davmlaw commented 1 year ago

Benchmarks after adding Variant.end = IntegerField()

Laptop

Full reload of Trio analysis on affected mother 2008 trio: http://localhost:8000/analysis/10743/

18.45, 19.62, 19.7, 17.99, 19.5

CCB West

http://localhost:8000/analysis/53/ - full reload of 2008 trio w/patient pheno node

14.73, 14.59, 15.1, 14.76, 14.49 = 14.73 avg

After:

15.88, 15.61, 16.05, 17.02, 16.4 = 16.192

So it's around 10% slower, adding the extra DB fields

davmlaw commented 1 year ago

https://cnvar.org/resources/CNV-annotation-standards/#cnv-term-use-comparison-in-computational-fileschema-formats

davmlaw commented 1 year ago

User testing (ie Shariant)

From a user point of view - not much has changed.

There was a very big internal changes (75 files), which involves being able to load (from VCF) and store internally symbolic alleles eg <DEL> and <DUP> instead of explicitly storing huge ref/alt sequences.

So - need to check that you can resolve HGVS for del/dups, that they can be created, only 1 can be created, and they resolve etc etc.

I changed the "internal variant representation" which used to grow extremely large and now it looks like eg chr1:1000-2000 <DEL>

EmmaTudini commented 1 year ago

@TheMadBug Could we do a big upload from ClinVar of dels/dups to test this?

EmmaTudini commented 1 year ago

Testing: -Uploaded the following variants

NM_000441.2(SLC26A4):c.1246_2340dup
NM_006758.2(U2AF1):c.101_303del
NM_025114.4(CEP290):c.125_1025del
NM_007294.3:c.671_4096del - within one exon
NM_007294.3(BRCA1):c.671_1650dup - less than 1k bases

Expected output: I think they should resolve (because they were before this change), but in the resolution, it should show the variant coordinates as or

Actual output: Failed - First 4 options resulted in errors below:

4598 UploadPipeline 7850 failed. Filename: VCF - Insert variants only (no samples etc) Error: No split VCF records. This is caused by pipeline error or empty VCF after cleaning

UploadPipeline 7863 failed. Filename: VCF - Insert variants only (no samples etc) Error: No split VC..

click here to see error in rollbar - user : url: https://rollbar.com/jimmy.andrews/VariantGrid/items/?utm_campaign=occurrence_message&utm_medium=slack&utm_source=rollbar-notification|VariantGrid in sharianttestVariantGrid in sharianttest Assign to user 1:27
4599 ValueError: No split VCF records. This is caused by pipeline error or empty VCF after cleaning

No split VCF records. This is caused by pipeline error or empty VCF after cleaning
click here to see error in rollbar - user :

5th option worked and maintained the old variant coordinate

EmmaTudini commented 1 year ago

Another example - NM_000441.2(SLC26A4):c.1246_2341ins23 NM_000441.2(SLC26A4):c.1_705dup

davmlaw commented 12 months ago

I think this is because the VCF was written with an N:

Eg do classification import with "NM_006758.2(U2AF1):c.101_303del"

21  43095482    .   N   <DEL>   .   .   SVLEN=8864;SVTYPE=DEL

The upload had a message: "Warning: Skipped 1 'non-standard bases in REF sequence' records"

While if you create it from a variant search for "NM_006758.2(U2AF1):c.101_303del", it has the ref base in there:

21  44515592    .   G   <DEL>   .   .   SVLEN=8864;SVTYPE=DEL

This is because ImportedAlleleInfo works by storing the strings, and then populating the object that way, this doesn't always set the reference base:

In [15]: VariantCoordinate.from_string("12:88126354-88141011 <DEL>")
Out[15]: VariantCoordinate(chrom='12', start=88126354, end=88141011, ref='N', alt='<DEL>')

davmlaw commented 12 months ago

Fixed Emma's example above - Just needed to pass in genome build so that it would populate reference base if not there

EmmaTudini commented 3 months ago

Testing now recorded in private issue - https://app.zenhub.com/workspaces/everything-space-5bb3158c4b5806bc2beae448/issues/gh/sacgf/variantgrid_private/3645

davmlaw commented 1 month ago

This is one of the early SV issues - have been using/testing very thoroughly now onto very minor issues

tested eg NM_000441.2(SLC26A4):c.1246_2340dup works fine on VG test, loads with conservation now too in vep 112

SACGF / variantgrid

Structural Variant - symbolic variants #54

824 CNV Callers and VCF processing

818 CNV Normalization and Representation

823 CNV Annotation - separate pipeline for large indels

4598 UploadPipeline 7850 failed. Filename: VCF - Insert variants only (no samples etc) Error: No split VCF records. This is caused by pipeline error or empty VCF after cleaning

4599 ValueError: No split VCF records. This is caused by pipeline error or empty VCF after cleaning