Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
449 stars 151 forks source link

AF value very wrong (rs144217347) #1042

Closed ntm closed 1 year ago

ntm commented 3 years ago

Describe the issue

Running VEP 104 on rs144217347 results in AF=0.999 , but this variant has a very low AF (more like 0.001 in 1KG). The continental AFs returnd by VEP are correct (AFR_AF etc...), it's only the combined 1KG AF that's wrong. This issue can have severe consequences: if users filter their variants based on AF to exclude frequent polymorphisms, they may miss pathogenic variants. https://www.ncbi.nlm.nih.gov/snp/rs144217347

Additional information

I recently migrated from VEP 101 to 104, and I am tracking down possible regressions. I did not have this error with v101. I expect other variants have similar issues, this is just one example that I stumbled upon.

System

But this probably doesn't matter: the error is visible even when running VEP from the web interface.

ensembl.org/Homo_sapiens/Tools/VEP/

helensch commented 3 years ago

Hi @ntm

Thank you very much for letting us know about this data issue.

We will

A current work around is to use allele frequency from continental populations (https://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_af_1kg) in place of the global allele frequency (AF) from 1000 Genomes Phase 3 (https://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_af)

Thanks again for bringing this issue to our attention. I apologise for any inconvenience caused by this. I will update this ticket when fixes are in place.

Regards Helen

worker000000 commented 3 years ago

@helensch Thanks a lot, I tried to repeat this error reported by @ntm , but I even can not ge the allele, and when I use the argument --pick , it does not select the right canonical transcript, in hg38 it is NM_001321739.2:c.676dup(https://www.ncbi.nlm.nih.gov/clinvar/variation/805830/), but in varsome and cbioportal, the right trasncript is NM_138804.4(https://varsome.com/variant/hg19/rs144217347?annotation-mode=germline) (https://www.cbioportal.org/results/mutations?case_set_id=all&gene_list=M1AP&cancer_study_list=5c8a7d55e4b046111fee2296) <> <> VEP version: 104.3 VEP Cache version: homo_sapiens_vep_104_GRCh37 Perl version: perl-5.16.3-299.el7_9.x86_64 OS: centos 7 tabix installed ? yes My command: I use the --af_1kg , and delete the --af <> docker run --rm --user id -u:id -g -v /home/DATA/kobe:/data docker.io/ensemblorg/ensembl-vep vep --input_file /data/MDPWZ21060015ZZCASE.test.vcf --output_file /data/MDPWZ21060015ZZCASE_stats.vep.vcf --dir_cache /data/database/anno/vep104 --dir_plugins /data/database/anno/vep104/VEP_plugins-release-104 --fasta /data/database/anno/vep104/Homo_sapiens.GRCh37.dna.primary_assembly_chr.fa --offline --cache --transcript_version --force_overwrite --transcript_version --refseq --assembly GRCh37 --format vcf --cache_version 104 --keep_csq --variant_class --vcf --pick --sift b --polyphen b --ccds --hgvs --symbol --numbers --canonical --protein --biotype --uniprot --appris --gene_phenotype --af_1kg --af_esp --af_gnomad --pubmed --var_synonyms --variant_class --fork 4 2021-09-19 09:12:46 - INFO: BAM-edited cache detected, enabling --use_transcript_ref; use --use_given_ref to override this <> here is the origin input(ignore the LSEQ or RSEQ, just the coordinate and base) <> chr2 74808894 . A 31 v3;f0.02 STATUS=StrongLOH;SAMPLE=MDPWZ21060015ZZ;TYPE=SNV;DP=700;VD=0;AF=0;SHIFT3=0;MSI=1.000;MSILEN=1;SSF=0.32598;SO R=0;LSEQ=ACATCTGGGCCTCCAGTTAC;RSEQ=AGAAAGGGCACCTAAGAAGG GT:DP:VD:ALD:RD:AD:AF:BIAS:PMEAN:PSTD:QUAL:QSTD:SBF:ODDRATIO:MQ:SN:HIAF:ADJAF:NM 0/0:700:0:0,0:270,430:700,0:0:2, 0:21.2:1:45:1:1:0:60:1400:1:0:0.9 0/0:932:2:0,2:432,497:929,2:0.0021:2,0:35:1:31:1:0.50205:0:60:4:0.0022:0:1 <> <> output <> chr2 74808894 . A 31 v3;f0.02 STATUS=StrongLOH;SAMPLE=MDPWZ21060015ZZ;TYPE=SNV;DP=700;VD=0;AF=0;SHIFT3=0;MSI=1.000;MSILEN=1;SSF=0.32598;SOR=0;LSEQ=ACATCTGGGCCTCCAGTTAC;RSEQ=AGAAAGGGCACCTAAGAAGG;CSQ=A|frameshift_variant|HIGH|M1AP|130951|Transcript|NM_001321739.2|protein_coding|5/11||NM_001321739.2:c.676dup|NP_001308668.1:p.Trp226LeufsTer4|794-795|676-677|226|W/LX|tgg/tTgg|||-1||sequence_alteration|EntrezGene||YES|||NP_001308668.1|||||||||||||||||||||||||||||||||| GT:DP:VD:ALD:RD:AD:AF:BIAS:PMEAN:PSTD:QUAL:QSTD:SBF:ODDRATIO:MQ:SN:HIAF:ADJAF:NM 0/0:700:0:0,0:270,430:700,0:0:2,0:21.2:1:45:1:1:0:60:1400:1:0:0.9 0/0:932:2:0,2:432,497:929,2:0.0021:2,0:35:1:31:1:0.50205:0:60:4:0.0022:0:1

worker000000 commented 3 years ago

@helensch it seems that when the rs number can not be found correctly, there will be no frequency annotation 1632044582(1)

sarahhunt commented 3 years ago

Hi @worker000000 - we report the RefSeq Select/ MANE Select transcript when a single RefSeq transcript is requested. Our GRCh37 gene annotation is not frequently updated, so we can sometimes be out of date, but NM_001321739.2 appears to still be the RefSeq recommended transcript-

https://www.ncbi.nlm.nih.gov/nuccore/NM_001321739.2/

Can you check your input file - the ref allele appears to be missing in the example pasted above. This could cause problems when seeking matching alleles in the reference datasets.

ntm commented 1 year ago

The original issue I reported has been fixed at some point: with VEP 107 forrs144217347 we now get the correct AF=0.0010 . Closing as fixed.