Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
452 stars 151 forks source link

HGVS uses fasta reference, not reference from input file (may be documentation issue) #865

Open davmlaw opened 4 years ago

davmlaw commented 4 years ago

Describe the issue

When you run VEP against a VCF record with a reference that is different from the fasta sequence, the consequences are calculated against provided base, while HGVS is calcualted from the fasta reference sequence.

Example (GRCh37) - provided ref different from fasta reference:

17:7574012 G>A  -> Amino Acid: Q/*, Codons: Cag/Tag, HGVS: ENST00000269305.4:c.1015G>T 

Expected result: HGVS would match provided reference and other annotations Actual result: HGVS is from fasta reference (17:7574012 C>A) and is inconsistent with consequences and input sequence

It may be that HGVS must use the reference sequence (as eg changing a base may chance a splice site and thus an exon, or alter alignment/normalization) - in which case, it may be a good idea to add a note it to the documentation, I suggest here:

https://asia.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_hgvs

System

Full VEP command line

/home/variantgrid/perlbrew_runner.sh /data/annotation/VEP/ensembl-vep/vep
i in.vcf
o out.vcf
-cache
-dir /data/annotation/VEP/vep_cache
-fasta /data/annotation/fasta/GCF_000001405.25_GRCh37.p13_genomic.fna.gz
-assembly GRCh37
-offline
-use_given_ref
-vcf
-compress_output gzip
-force_overwrite
-flag_pick
-exclude_predicted
-no_stats
-check_existing
-no_escape
-sift b
-uniprot
-hgvs
-symbol
-numbers
-domains
-canonical
-protein
-biotype
-uniprot
-af
-pubmed
-variant_class
-plugin Grantham
-plugin SpliceRegion
-plugin LoFtool
-plugin Mastermind,/data/annotation/VEP/annotation_data/GRCh37/mastermind_cited_variants_reference-2020.04.02-grch37.vcf.gz,1
-plugin MaxEntScan,/data/annotation/VEP/annotation_data/all_builds/maxentscan
-plugin dbNSFP,/data/annotation/VEP/annotation_data/GRCh37/dbNSFP4.0a.grch37.stripped.gz,CADD_phred,CADD_raw,FATHMM_pred,GERP++_RS,Interpro_domain,MutationAssessor_pred,MutationTaster_pred,Polyphen2_HVAR_pred,REVEL_score
-plugin dbscSNV,/data/annotation/VEP/annotation_data/GRCh37/dbscSNV1.1_GRCh37.txt.gz
-plugin SpliceAI,snv=/data/annotation/VEP/annotation_data/GRCh37/spliceai_scores.raw.snv.hg19.vcf.gz,indel=/data/annotation/VEP/annotation_data/GRCh37/spliceai_scores.raw.indel.hg19.vcf.gz
-custom /data/annotation/VEP/annotation_data/GRCh37/gnomad_GRCh37_combined_af.vcf.bgz,gnomAD,vcf,exact,0,AF,AF_afr,AF_amr,AF_asj,AF_eas,AF_fin,AF_nfe,AF_oth,AF_popmax,AF_sas,gnomad_filtered,nhomalt,popmax
-custom /data/annotation/VEP/annotation_data/GRCh37/hg19.100way.phastCons.bw,phastCons100way_vertebrate,bigwig,overlap
-custom /data/annotation/VEP/annotation_data/GRCh37/hg19.phastCons46way.placental.bw,phastCons46way_mammalian,bigwig,overlap
-custom /data/annotation/VEP/annotation_data/GRCh37/hg19.100way.phyloP100way.bw,phyloP100way_vertebrate,bigwig,overlap
-custom /data/annotation/VEP/annotation_data/GRCh37/hg19.phyloP46way.placental.bw,phyloP46way_mammalian,bigwig,overlap
-custom /data/annotation/VEP/annotation_data/GRCh37/repeatmasker_hg19.bed.gz,REPEAT_MASKER,bed,overlap
-custom /data/annotation/VEP/annotation_data/GRCh37/TOPMED_GRCh37.vcf.gz,TopMed,vcf,exact,0,TOPMED
-custom /data/annotation/VEP/annotation_data/GRCh37/UK10K_COHORT.20160215.sites.vcf.gz,UK10k,vcf,exact,0,AF
-custom /data/annotation/VEP/annotation_data/GRCh37/CosmicCodingMuts.normal.grch37.vcf.gz,COSMIC,vcf,exact,0,CNT,LEGACY_ID
aparton commented 4 years ago

Hi,

Thanks for this report. You're correct, there's an underlying inconsistency and documentation issue here that we can improve.

Currently, our HGVSg and HGVSc descriptions take their underlying reference sequence directly from our internal sequence lookups, rather than from what's given by the user. This is designed to limit any problems with downstream analyses by ensuring our HGVS is accurate. However, this is not overwritten by --use_given_ref, when it arguably should be. As suggested, our documentation should also provide more detail about this behaviour.

Additionally, HGVSp does not follow this behaviour - it will use the given reference when calculating the appropriate peptide changes. This different behaviour between HGVSp and HGVSc is something that we will look to address.

We'll discuss this issue within the team over the next few days to decide exactly now to tackle this issue, and we'll let you know how we intend to proceed. Thank you for bringing this to our attention.

Kind Regards, Andrew