Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
456 stars 152 forks source link

MSG: The coordinate interval for display is of different length than for the reference allele #1775

Open suzyhh opened 1 month ago

suzyhh commented 1 month ago

Describe the issue

Hello, I'm testing v113 using the v113 docker obtained from docker hub and the v113 cache downloaded using the following command: docker run --network host -t -i -v /mnt/data1/software/vep/:/opt/vep/.vep ensemblorg/ensembl-vep:release_113.0 INSTALL.pl -a c -s homo_sapiens_merged -y GRCh38

I have successfully tested v113 with a VCF containing structural variants, but I encounter an error when running with a VCF containing SNVs/indels. After the error the process just hangs doing nothing until I kill it. The output files contains the VCF header but no variants.

System

Full VEP command line

vep --cache --offline \
    -i ~{vcf} \
    --dir_cache /opt/vep/.vep \
    --dir_plugins /opt/vep/.vep/Plugins/ \
    --vcf --compress_output bgzip \
    --merged \
    --fasta $refFasta \
    --assembly GRCh38 \
    --no_stats \
    --fork ~{fork} \
    --buffer_size ~{buffer} \
    --no_escape \
    --check_existing \
    --hgvs --hgvsg \
    --af --af_gnomadg \
    --protein --uniprot \
    --symbol \
    --numbers \
    --allele_number \
    --sift b --polyphen b \
    --pubmed \
    --show_ref_allele \
    --variant_class \
    --mane_select \
    --transcript_version \
    --flag_pick_allele_gene \
    --plugin MaxEntScan,/opt/vep/.vep/fordownload,NCSS,SWA \
    --plugin SpliceAI,snv=~{spliceaiSnv},indel=~{spliceaiIndel} \
    --plugin SpliceRegion \
    --plugin NearestExonJB,max_range=100 \
    --plugin REVEL,/opt/vep/.vep/new_tabbed_revel_grch38.tsv.gz \
    --plugin SpliceDistance \
    --plugin UTRAnnotator,file=/opt/vep/.vep/fordownload/uORF_5UTR_GRCh38_PUBLIC.txt \
    --plugin GnomadPli,file=~{gnomadv4Pli} \
    --plugin CADD,snv=~{caddSnv},indels=~{caddIndel} \
    --plugin AlphaMissense,file=~{alphaMissense} \
    --custom file=${hgmdZip},short_name=HGMD,format=vcf,type=exact,coords=0,fields=CLASS%PHEN \
    --custom file=~{clinVarVcf},short_name=ClinVar,format=vcf,type=exact,coords=0,fields=CLNSIG%CLNREVSTAT%CLNDN \
    --custom file=~{gnomadv4Vcf},short_name=gnomad4,format=vcf,type=exact,coords=0,fields=homCount%hetCount%hemiCount%gnomadFilter \
    --force_overwrite \
    -o ~{outName}_snvs.vep.vcf.gz

Full error message

STDERR 
STDERR -------------------- EXCEPTION --------------------
STDERR MSG: The coordinate interval for display is of different length than for the reference allele
STDERR STACK Bio::EnsEMBL::Variation::Utils::Sequence::hgvs_variant_notation /opt/vep/src/ensembl-vep/Bio/EnsEMBL/Variation/Utils/Sequence.pm:511
STDERR STACK Bio::EnsEMBL::Variation::VariationFeature::hgvs_genomic /opt/vep/src/ensembl-vep/Bio/EnsEMBL/Variation/VariationFeature.pm:2013
STDERR STACK Bio::EnsEMBL::VEP::OutputFactory::VariationFeatureOverlapAllele_to_output_hash /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/OutputFactory.pm:1314
STDERR STACK Bio::EnsEMBL::VEP::OutputFactory::TranscriptVariationAllele_to_output_hash /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/OutputFactory.pm:1606
STDERR STACK Bio::EnsEMBL::VEP::OutputFactory::get_all_VariationFeatureOverlapAllele_output_hashes /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/OutputFactory.pm:393
STDERR STACK Bio::EnsEMBL::VEP::OutputFactory::get_all_output_hashes_by_VariationFeature /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/OutputFactory.pm:347
STDERR STACK Bio::EnsEMBL::VEP::OutputFactory::VCF::get_all_lines_by_InputBuffer /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/OutputFactory/VCF.pm:316
STDERR STACK Bio::EnsEMBL::VEP::Runner::_buffer_to_output /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:422
STDERR STACK (eval) /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:671
STDERR STACK Bio::EnsEMBL::VEP::Runner::_forked_process /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:665
STDERR STACK Bio::EnsEMBL::VEP::Runner::_forked_buffer_to_output /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:496
STDERR STACK Bio::EnsEMBL::VEP::Runner::next_output_line /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:367
STDERR STACK Bio::EnsEMBL::VEP::Runner::run /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:208
STDERR STACK toplevel /opt/vep/src/ensembl-vep/vep:46
STDERR Date (localtime)    = Mon Oct 21 13:54:46 2024
STDERR Ensembl API version = 113
STDERR ---------------------------------------------------
STDERR Died in forked process 53

I have tested my command using v112 and it completes successfully with no errors/warnings so it does seem to be a v113 problem. Running locally I have found that the problem occurs when using the --hgvsg flag. If I run using only --hgvs I do not get the above error, but I do get several warnings, here is a small snippet:

WARNING: 12 : WARNING: Transcript-assembly mismatch in rs6650119
WARNING: Transcript-assembly mismatch in rs6650119
WARNING: 13 : WARNING: Transcript-assembly mismatch in rs2275166
WARNING: Transcript-assembly mismatch in rs2275166
WARNING: 14 : WARNING: Transcript-assembly mismatch in rs309472
WARNING: Transcript-assembly mismatch in rs309472
WARNING: Transcript-assembly mismatch in rs586178
WARNING: Transcript-assembly mismatch in rs586178
WARNING: 39 : WARNING: Transcript-assembly mismatch in rs193922695
WARNING: Transcript-assembly mismatch in rs193922695
WARNING: 52 : WARNING: Transcript-assembly mismatch in rs7601549
WARNING: Transcript-assembly mismatch in rs7601549

I do not receive any errors or warning when I run v113 without both --hgvs and --hgvsg. This unfortunately renders v113 unusable for us as we rely on the hgvs annotations.

Many thanks!

likhitha-surapaneni commented 1 month ago

Hi @suzyhh , thank you for reporting the issue. Can you kindly provide us the input file/test input used?

suzyhh commented 1 month ago

@likhitha-surapaneni

I have attached the input file (bgzipped for compatibility with github) that generates the error - ta!

likhitha-surapaneni commented 1 month ago

Hi @suzyhh , thank you for providing us the file. We know why --hgvsg might be erroring out and are trying to push in a post-release fix for it (related to https://github.com/Ensembl/ensembl-variation/pull/1124)

Regarding the warnings associated with --hgvs, we are trying to investigate this and will update you

likhitha-surapaneni commented 1 month ago

Hi @suzyhh , the fixes for the error are now available on the latest docker images (latest or release_113.1).

The warning "Transcript-assembly mismatch in rs586178" means there is a mismatch between refseq and ensembl in this position. By default, VEP uses the ref alleles from refseq. For this example variant rs586178, the HGVSc is NM_001330430.4:c.48C>C, this is not a valid variant (basically edits the ref G to a C ). After editing the allele the variant becomes C>C. The consequence is synonymous which is also not correct. If you use --use_given_ref , the output would be NM_001330430.4:c.48G>C. Hope this is useful. We will try to make the warning and docs clearer for this case.

Thank you and let us know if this solves the issue.

suzyhh commented 1 month ago

@likhitha-surapaneni Thanks so much for this fix, I can now run the new version of 113 :)

I have to admit I don't 100% understand the --use_given_ref option, but I have tested in v113 and it doesn't seem to have the behaviour you describe. I've tested using rs778986.

When I run without --use_given_ref I see the behaviour you described, with the variant being incorrectly described as synonymous with the nomenclature G>G: image

However when I run with --use_given_ref, the HGVSc is no longer G>G, but I think is also incorrect (A>G and it should be T>C?), and the consequence and codons is also incorrect: image

I've tested with v112 using --use_given_ref and this gives me what I expect is the correct behaviour: image The cDNA change in the HGVSc is also T>C in v112 which I think is correct

Many thanks!

likhitha-surapaneni commented 3 weeks ago

Hi @suzyhh , Thank you for reporting this issue with clear examples. We are currently putting in a fix to return the correct HGVSc. Upon investigation, it turns out that the consequence calculation is ignoring the flag --use_given_ref particularly in 113 (it is reported correctly in 112 as illustrated in the examples provided). We are working on this and will let you know once we have an update.

Thank you for your patience.

suzyhh commented 2 weeks ago

Hiya @likhitha-surapaneni

I think I've found an additional bug(s) in the HGVS calculations in v113.1 possibly relating to the original error I reported. I've found several instances where variants are describe as "ins" by the HGVSc but "dup" by the HGVSg and vice versa:

1) HGVSc described as "ins"; HGVSg described as "dup": image While I haven't checked extensively, it looks like this error is occuring on variants with more than 1 alternate allele. This is the VCF entry for the first variant in the above screenshot: image Here is an example of the VCF VEP output for a few of these variants: hgvsC_ins_example.vep.vcf.txt

2) HGVSc described as "dup"; HGVSg described as "ins": image There's fewer of these, and I can't see anything tying these variants together apart from the fact they are all rather large duplications (all are >50bp in size). Here is an example of the VCF VEP output for a few of these: hgvsG_ins_example.vep.vcf.txt

Thank you!