Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
437 stars 150 forks source link

VEP v112 `ALLELE_NUM` empty in output VCF for input SV #1698

Closed dennishendriksen closed 6 days ago

dennishendriksen commented 3 weeks ago

Hello VEP team,

After updating VEP v111.0 to v112.0 one of our downstream tool crashes due to an empty string value in the ALLELE_NUM field. See for example chr22:29767384 G>[1:109650635[GG in GRCh37_annotated.vcf.gz.

VEP v112

image

VEP v111

image

Q1: Is this intended? I would expect this field to always contain a ALT allele index.

Q2: In the images above you might also notice changes to Allele field values:

Could you explain what the dot in the new output means?

Q3: A last observation is that the number of consequences went down from 10 to 7. Could you explain this difference?

Possibly these changes are related to the 'Enhanced Structural Variant Support' feature in v112?

nuno-agostinho commented 3 weeks ago

Hi @dennishendriksen,

The results you are obtaining for that breakpoint variant seem incorrect.

In VEP 111, we represented the alternative allele of the breakpoint (in your case, [1:109650635[GG) to indicate all potential consequences. However, this is confusing if a breakpoint is composed by two or more chromosomal breakends.

As such, in VEP 112, we now separate the consequences of a breakpoint variant for each breakend:

To answer your questions:

Q1: Is this intended? I would expect this field to always contain a ALT allele index.

Unfortunately, it seems that VEP 112 is returning nothing for the allele number for breakpoint variants. I am going to check how to fix it.

Q2: (...) Could you explain what the dot in the new output means?

The representation depicts a single breakend and its orientation:

More information at VCF 4.4 standard, chapter 5.4.9: Single breakends.

Q3: A last observation is that the number of consequences went down from 10 to 7. Could you explain this difference?

I'll also check if the changes are expected or not.

Thanks for reporting this issue! I'll report back as soon as possible.

Best regards, Nuno

nuno-agostinho commented 2 weeks ago

Hey @dennishendriksen,

Just to update you: I opened PR Ensembl/ensembl-variation#1095 to fix allele numbers for breakends. This will be available in the next version of VEP.

Thanks again for reporting this issue!

Cheers, Nuno

nuno-agostinho commented 6 days ago

Hey @dennishendriksen,

The bug fix to the allele number in breakpoint variants has now been merged to the code in the next version of VEP (VEP 113).

I will close this issue but feel free to open a new one if you find further issues or have any suggestions.

Cheers, Nuno

dennishendriksen commented 6 days ago

Hi @nuno-agostinho,

Thank you for this fix!

Q3: A last observation is that the number of consequences went down from 10 to 7. Could you explain this difference?

I'll also check if the changes are expected or not.

Did you get around to checking this?

Greetings, @dennishendriksen

nuno-agostinho commented 6 days ago

Hi @dennishendriksen,

Sorry for closing the issue prematurely.

I was not able to replicate your results. Could you please send me the VEP command that you run to get those results?

Thanks, Nuno

dennishendriksen commented 6 days ago

Hi @nuno-agostinho,

From the previously attached vcf:

vep --allele_number --allow_non_variant --assembly GRCh38 --buffer_size 1000 --cache --compress_output bgzip --custom [PATH]/hg38.phyloP100way.bw,phyloP,bigwig,exact,0 --database 0 --dir_cache [PATH]/cache --dir_plugins [PATH]/plugins --dont_skip --exclude_predicted --fasta [PATH]/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz --flag_pick_allele --fork 4 --format vcf --hgvs --input_file GRCh37_normalized.vcf.gz --no_stats --numbers --offline --output_file GRCh37_annotated.vcf.gz --plugin Grantham --plugin SpliceAI,snv=[PATH]/spliceai_scores.masked.snv.hg38.vcf.gz,indel=[PATH]/spliceai_scores.masked.indel.hg38.vcf.gz --plugin Capice,GRCh37_capice_output.tsv.gz --plugin UTRannotator,[PATH]/uORF_5UTR_PUBLIC.txt --plugin Inheritance,[PATH]/inheritance_20240115.tsv --plugin VKGL,[PATH]/vkgl_consensus_20240401.tsv,1 --plugin gnomAD,[PATH]/gnomad.total.v4.1.sites.stripped.tsv.gz --plugin ClinVar,[PATH]/clinvar_20240603_stripped.tsv.gz --plugin AnnotSV,GRCh37_normalized.vcf.gz.tsv,AnnotSV_ranking_score;AnnotSV_ranking_criteria;ACMG_class --plugin AlphScore,[PATH]/AlphScore_final_20230825_stripped_GRCh38.tsv.gz --plugin ncER,[PATH]/GRCh38_ncER_perc.bed.gz --plugin FATHMM_MKL_NC,[PATH]/GRCh38_FATHMM-MKL_NC.tsv.gz --plugin ReMM,[PATH]/GRCh38_ReMM.tsv.gz --polyphen s --pubmed --refseq --safe --shift_3prime --sift s --symbol --total_length --use_given_ref --vcf

Greetings, @dennishendriksen

nuno-agostinho commented 6 days ago

Hey @dennishendriksen,

I am confused by your command, as you are mixing GRCh37 and GRCh38 data.

For GRCh38, the alternative breakend [1:109650635[G should only return an intergenic variant[^1], whereas there are Transcript consequences if you use --assembly GRCh37.

Could you check if the results make sense for you when using GRCh37 throughout the VEP command?

Thanks, Nuno

[^1]: However, the results only show results for the reference breakend (.G). This is a bug, it should also show intergenic variants if there are no other consequences. I will try to fix this.

dennishendriksen commented 6 days ago

Hi @nuno-agostinho,

Apologies for the confusing filename, this is an artifact after liftover from GRCh37 to GRCh38. Both file content and command should be GRCh38. I'm not an expert on breakend notations, could it be that you missed the final G in G>[1:109650635[GG?

Greetings, @dennishendriksen

nuno-agostinho commented 6 days ago

Hi @dennishendriksen,

could it be that you missed the final G in G>[1:109650635[GG?

Currently, the alternative sequence of a breakend is ignored by VEP. We intend to improve this in the future.

Upon further inspection, the difference may be related with updates to the Ensembl database. For instance, one of the consequences for the breakend [1:109650635[GG in GRCh38 is associated with regulatory feature ENSR00001170488, which is not available in the current version of Ensembl.

If you want the same results as in VEP 111, you can download the previous VEP cache from http://ftp.ensembl.org/pub/release-111/variation/vep and then run VEP with option --db_version 111. However, I would suggest to simply use the most up-to-date version of VEP cache when possible.

Hope this makes it clearer, but tell me if you want to discuss this further. Thanks!

Cheers, Nuno

dennishendriksen commented 6 days ago

Hi @nuno-agostinho,

Good to know that it is a change in database content (I had not thought on running VEP v112 with the 111 database). Case closed, thank you for your effort and time, greatly appreciated.

Cheers, @dennishendriksen

nuno-agostinho commented 6 days ago

HI @dennishendriksen,

We are always here to help! Glad you reported the issue so that we could improve VEP.

Have a great day! 😄

Cheers, Nuno