Closed ntm closed 1 year ago
Hi @ntm
Thank you very much for letting us know about this data issue.
We will
A current work around is to use allele frequency from continental populations (https://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_af_1kg) in place of the global allele frequency (AF) from 1000 Genomes Phase 3 (https://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_af)
Thanks again for bringing this issue to our attention. I apologise for any inconvenience caused by this. I will update this ticket when fixes are in place.
Regards Helen
@helensch
Thanks a lot, I tried to repeat this error reported by @ntm , but I even can not ge the allele, and when I use the argument --pick , it does not select the right canonical transcript, in hg38 it is NM_001321739.2:c.676dup(https://www.ncbi.nlm.nih.gov/clinvar/variation/805830/), but in varsome and cbioportal, the right trasncript is NM_138804.4(https://varsome.com/variant/hg19/rs144217347?annotation-mode=germline) (https://www.cbioportal.org/results/mutations?case_set_id=all&gene_list=M1AP&cancer_study_list=5c8a7d55e4b046111fee2296)
<>
<>
VEP version: 104.3
VEP Cache version: homo_sapiens_vep_104_GRCh37
Perl version: perl-5.16.3-299.el7_9.x86_64
OS: centos 7
tabix installed ? yes
My command: I use the --af_1kg , and delete the --af
<>
docker run --rm --user id -u
:id -g
-v /home/DATA/kobe:/data docker.io/ensemblorg/ensembl-vep vep --input_file /data/MDPWZ21060015ZZCASE.test.vcf --output_file /data/MDPWZ21060015ZZCASE_stats.vep.vcf --dir_cache /data/database/anno/vep104 --dir_plugins /data/database/anno/vep104/VEP_plugins-release-104 --fasta /data/database/anno/vep104/Homo_sapiens.GRCh37.dna.primary_assembly_chr.fa --offline --cache --transcript_version --force_overwrite --transcript_version --refseq --assembly GRCh37 --format vcf --cache_version 104 --keep_csq --variant_class --vcf --pick --sift b --polyphen b --ccds --hgvs --symbol --numbers --canonical --protein --biotype --uniprot --appris --gene_phenotype --af_1kg --af_esp --af_gnomad --pubmed --var_synonyms --variant_class --fork 4
2021-09-19 09:12:46 - INFO: BAM-edited cache detected, enabling --use_transcript_ref; use --use_given_ref to override this
<>
here is the origin input(ignore the LSEQ or RSEQ, just the coordinate and base)
<>
chr2 74808894 . A 31 v3;f0.02 STATUS=StrongLOH;SAMPLE=MDPWZ21060015ZZ;TYPE=SNV;DP=700;VD=0;AF=0;SHIFT3=0;MSI=1.000;MSILEN=1;SSF=0.32598;SO
R=0;LSEQ=ACATCTGGGCCTCCAGTTAC;RSEQ=AGAAAGGGCACCTAAGAAGG GT:DP:VD:ALD:RD:AD:AF:BIAS:PMEAN:PSTD:QUAL:QSTD:SBF:ODDRATIO:MQ:SN:HIAF:ADJAF:NM 0/0:700:0:0,0:270,430:700,0:0:2,
0:21.2:1:45:1:1:0:60:1400:1:0:0.9 0/0:932:2:0,2:432,497:929,2:0.0021:2,0:35:1:31:1:0.50205:0:60:4:0.0022:0:1
<>
<>
output
<>
chr2 74808894 . A 31 v3;f0.02 STATUS=StrongLOH;SAMPLE=MDPWZ21060015ZZ;TYPE=SNV;DP=700;VD=0;AF=0;SHIFT3=0;MSI=1.000;MSILEN=1;SSF=0.32598;SOR=0;LSEQ=ACATCTGGGCCTCCAGTTAC;RSEQ=AGAAAGGGCACCTAAGAAGG;CSQ=A|frameshift_variant|HIGH|M1AP|130951|Transcript|NM_001321739.2|protein_coding|5/11||NM_001321739.2:c.676dup|NP_001308668.1:p.Trp226LeufsTer4|794-795|676-677|226|W/LX|tgg/tTgg|||-1||sequence_alteration|EntrezGene||YES|||NP_001308668.1|||||||||||||||||||||||||||||||||| GT:DP:VD:ALD:RD:AD:AF:BIAS:PMEAN:PSTD:QUAL:QSTD:SBF:ODDRATIO:MQ:SN:HIAF:ADJAF:NM 0/0:700:0:0,0:270,430:700,0:0:2,0:21.2:1:45:1:1:0:60:1400:1:0:0.9 0/0:932:2:0,2:432,497:929,2:0.0021:2,0:35:1:31:1:0.50205:0:60:4:0.0022:0:1
@helensch it seems that when the rs number can not be found correctly, there will be no frequency annotation
Hi @worker000000 - we report the RefSeq Select/ MANE Select transcript when a single RefSeq transcript is requested. Our GRCh37 gene annotation is not frequently updated, so we can sometimes be out of date, but NM_001321739.2 appears to still be the RefSeq recommended transcript-
https://www.ncbi.nlm.nih.gov/nuccore/NM_001321739.2/
Can you check your input file - the ref allele appears to be missing in the example pasted above. This could cause problems when seeking matching alleles in the reference datasets.
The original issue I reported has been fixed at some point: with VEP 107 forrs144217347 we now get the correct AF=0.0010 . Closing as fixed.
Describe the issue
Running VEP 104 on rs144217347 results in AF=0.999 , but this variant has a very low AF (more like 0.001 in 1KG). The continental AFs returnd by VEP are correct (AFR_AF etc...), it's only the combined 1KG AF that's wrong. This issue can have severe consequences: if users filter their variants based on AF to exclude frequent polymorphisms, they may miss pathogenic variants. https://www.ncbi.nlm.nih.gov/snp/rs144217347
Additional information
I recently migrated from VEP 101 to 104, and I am tracking down possible regressions. I did not have this error with v101. I expect other variants have similar issues, this is just one example that I stumbled upon.
System
But this probably doesn't matter: the error is visible even when running VEP from the web interface.
ensembl.org/Homo_sapiens/Tools/VEP/