Closed asmlgkj closed 2 years ago
Hi @asmlgkj, The HGVS being shifted should not interfere with the ClinVar results on VEP. Could you please provide an example?
@dglemos thanks a lot. it is just a guess, how vep use the data in clinvar, because the 3 rule will also need to change the position of ref, for example <> On the transcript, c.1_2delAT deletes AT from …AGGATGCG…, resulting in …AGGGCG…. There’s no ambiguity about what sequence was actually deleted.
c.1_3delATG deletes ATG, resulting in …AGGCG…. Note that you could also get this result by deleting GAT. <> so the ref postion in the ref base should also be changed, and clinvar keep the Chr Start End Ref Alt CLNALLELEID CLNDN CLNDISDB CLNREVSTAT CLNSIG <> when vep normalize a variant and then anno with clinvar, it should match from one record to another or from yes to no or from no to yes, <> it is just my guess, I am not clear about the inner process of vep
Thank you for the example. The output can have missing ClinVar results when using the ClinVar data (here) as custom annotation. VEP does not left-normalize insertion or deletion variants in repetitive sequence however, the ClinVar VCF is left normalized. As an example, for the VCF input:
#CHROM POS ID REF ALT QUAL FILTER INFO
21 37490194 rs1555984102 TAT T . . .
The VEP output using ClinVar custom annotation doesn't include any ClinVar data. However, this variant is found in the ClinVar VCF file:
21 37490193 520853 TTA T . . ALLELEID=512486;CLNDISDB=MONDO:MONDO:0013578,MedGen:C3279839,OMIM:614104,Orphanet:ORPHA464306|MONDO:MONDO:0100038,MedGen:CN283328|MeSH:D030342,MedGen:C0950123;CLNDN=Mental_retardation,_autosomal_dominant_7|Complex_neurodevelopmental_disorder|Inborn_genetic_diseases;CLNHGVS=NC_000021.9:g.37490195_37490196del;CLNREVSTAT=criteria_provided,_multiple_submitters,_no_conflicts;CLNSIG=Pathogenic;CLNVC=Deletion;CLNVCSO=SO:0000159;CLNVI=GenomeConnect_-_Simons_Searchlight:16886-x1_1;GENEINFO=DYRK1A:1859;MC=SO:0001589|frameshift_variant;ORIGIN=33;RS=1555984102
A workaround is to pre-process the VCF input before running VEP with tools such as is vt normalize. You can read more about this tool here: https://genome.sph.umich.edu/wiki/Vt
The same variant after being processed by vt tool:
21 37490193 rs1555984102 TTA T . . OLD_VARIANT=21:37490194:TAT/T
The VEP output includes ClinVar data:
rs1555984102 21:37490194-37490195 - ENSG00000157540 ENST00000338785 Transcript frameshift_variant 933-934 684-685 228-229 FM/FX ttTAtg/tttg rs1555984102 IMPACT=HIGH;STRAND=1;CLIN_SIG=pathogenic;PHENO=1;ClinVar=520853;ClinVar_CLNDN=Mental_retardation,_autosomal_dominant_7|Complex_neurodevelopmental_disorder|Inborn_genetic_diseases;ClinVar_CLNREVSTAT=criteria_provided,_multiple_submitters,_no_conflicts;ClinVar_CLNSIG=Pathogenic;ClinVar_FILTER=.
Thank you for the example.
I havae 2 questions here.
1
tools like vt or bcftools norm do left norm, just like database clinvar, they do the same behaviour
hgvs do 3 rule (right norm) , your example seems to does not have something to do with hgvs
<>
2
I make a fake variant in vcf, the first record in my vcf.
both vt and bcftools will report the Reference allele mismatch at chr21:37490194 .. REF_SEQ:'ACA' vs VCF:'TAT', so it seems not being able to norm this variant.
<>
here is the vcf, I zipped it for uploading, you can unzip and will get the vcf.
chr21.zip
here is example of a special example, vep missed both clin_sig, but annovar get the right clin_sig of origin one
it is a record in clinvar chr4 55141030 55141030 - GAGGGA 28586 Gastrointestinal_stromal_tumor Human_Phenotype_Ontology:HP:0100723,MONDO:MONDO:0011719,MeSH:D046152,MedGen:C0238198,OMIM:606764,Orphanet:ORPHA44890 no_assertion_criteria_provided Pathogenic
here is the before normalize variant chr4 55141030 . G GGAGGGA vep output is
INFO=
chr4 55141030 . G GGAGGGA 45 f0.02 SAMPLE=FL202109121;TYPE=SNV;DP=710;END=46957784;VD=2;AF=0.0028;BIAS=2:0;REFBIAS=509:199;VARBIAS=2:0;PMEAN=33;PSTD=1;QUAL=45;QSTD=0;SBF=1;ODDRATIO=0;MQ=60;SN=4;HIAF=0.0028;ADJAF=0;SHIFT3=0;MSI=1;MSILEN=1;NM=2.5;HICNT=2;HICOV=710;LSEQ=AAGCCGTAGAAGCAAAGGTA;RSEQ=CACACGAGGTGCCGCCAGGA;DUPRATE=0;SPLITREAD=0;SPANPAIR=0;CSQ=GAGGGA|inframe_insertion|MODERATE|PDGFRA|5156|Transcript|NM_001347828.2|protein_coding|13/24||NM_001347828.2:c.1756_1757insAGAGGG|NP_001334757.1:p.Arg585_Val586insGluArg|1889-1890|1751-1752|584|W/WRE|tgg/tgGAGGGAg|||1||insertion|EntrezGene||YES|||NP_001334757.1|||||||||||||5||||||||||||||||||||| GT:DP:VD:AD:AF:RD:ALD 0/0:710:2:708,2:0.0028:509,199:2,0
annovar output is
Chr Start End Ref Alt Func.refGeneWithVer Gene.refGeneWithVer GeneDetail.refGeneWithVer ExonicFunc.refGeneWithVer AAChange.refGeneWithVer 1000g2015aug_all CLNALLELEID CLNDN CLNDISDB CLNREVSTAT CLNSIG avsnp150 cosmic70 Otherinfo chr4 55141030 55141030 - GAGGGA exonic PDGFRA . nonframeshift insertion PDGFRA:NM_006206.5:exon12:c.1676_1677insGAGGGA:p.R560_V561insER . 28586 Gastrointestinal_stromal_tumor Human_Phenotype_Ontology:HP:0100723,MONDO:MONDO:0011719,MeSH:D046152,MedGen:C0238198,OMIM:606764,Orphanet:ORPHA44890 no_assertion_criteria_provided Pathogenic rs587776794 . 0 45 710 chr4 55141030 . G GGAGGGA 45 f0.02 SAMPLE=FL202109121;TYPE=SNV;DP=710;END=46957784;VD=2;AF=0.0028;BIAS=2:0;REFBIAS=509:199;VARBIAS=2:0;PMEAN=33;PSTD=1;QUAL=45;QSTD=0;SBF=1;ODDRATIO=0;MQ=60;SN=4;HIAF=0.0028;ADJAF=0;SHIFT3=0;MSI=1;MSILEN=1;NM=2.5;HICNT=2;HICOV=710;LSEQ=AAGCCGTAGAAGCAAAGGTA;RSEQ=CACACGAGGTGCCGCCAGGA;DUPRATE=0;SPLITREAD=0;SPANPAIR=0 GT:DP:VD:AD:AF:RD:ALD 0/0:710:2:708,2:0.0028:509,199:2,0 <> <>
here is the after normalize variant chr4 55141035 . AGAGGG vep output is
INFO=
annovar output is
Chr Start End Ref Alt Func.refGeneWithVer Gene.refGeneWithVer GeneDetail.refGeneWithVer ExonicFunc.refGeneWithVer AAChange.refGeneWithVer 1000g2015aug_all CLNALLELEID CLNDN CLNDISDB CLNREVSTAT CLNSIG avsnp150 cosmic70 Otherinfo chr4 55141034 55141034 - AGAGGG exonic PDGFRA . nonframeshift insertion PDGFRA:NM_006206.5:exon12:c.1680_1681insAGAGGG:p.R560_V561insRG . . . . . . . . 0 45 710 chr4 55141035 . AGAGGG 45 f0.02 SAMPLE=FL202109121;TYPE=SNV;DP=710;END=46957784;VD=2;AF=0.0028;BIAS=2:0;REFBIAS=509:199;VARBIAS=2:0;PMEAN=33;PSTD=1;QUAL=45;QSTD=0;SBF=1;ODDRATIO=0;MQ=60;SN=4;HIAF=0.0028;ADJAF=0;SHIFT3=0;MSI=1;MSILEN=1;NM=2.5;HICNT=2;HICOV=710;LSEQ=AAGCCGTAGAAGCAAAGGTA;RSEQ=CACACGAGGTGCCGCCAGGA;DUPRATE=0;SPLITREAD=0;SPANPAIR=0 GT:DP:VD:AD:AF:RD:ALD 0/0:710:2:708,2:0.0028:509,199:2,0
any comment about this? thanks a lot
tools like vt or bcftools norm do left norm, just like database clinvar, they do the same behaviour hgvs do 3 rule (right norm) , your example seems to does not have something to do with hgvs
Yes, my example is not related to HGVS. The HGVS shifting shouldn't affect the ClinVar annotation.
We are checking why the variant chr4 55141030 . G GGAGGGA
does not have ClinVar annotation using the custom option. In the meantime, if you are interested in phenotype data it is possible to obtain this data with the VEP plugin Phenotypes.
Could you please share which VEP command you run to annotate the variant chr4 55141030 G GAGGGA
?
thanks a lot
<>
docker run --rm --user id -u
:id -g
-v /home/DATA/kobe:/data docker.io/ensemblorg/ensembl-vep vep --input_file /data/B.vcf --output_file /data/vep_B.vcf --dir_cache /data/database/anno/vep104 --dir_plugins /data/database/anno/vep104/VEP_plugins-release-104 --fasta /data/database/anno/vep104/Homo_sapiens.GRCh37.dna.primary_assembly_chr.fa --offline --cache --force_overwrite --transcript_version --refseq --assembly GRCh37 --format vcf --cache_version 104 --keep_csq --variant_class --vcf --sift b --polyphen b --ccds --hgvs --symbol --numbers --canonical --gene_phenotype --af_1kg --af_esp --af_gnomad --pubmed --var_synonyms --variant_class --fork 14 --check_existing --phased --numbers --xref_refseq --tsl --terms SO
By default, VEP returns allele specific clinical significance. For variant 4 55141030 rs587776794 G GGAGGGA
the alt allele is GGAGGGA however, alt allele from ClinVar is AGAGGG which doesn't match GGAGGGA. If you run VEP with option --clin_sig_allele 0
VEP will provide all known clinical significance values at the given locus without checking the alt allele. The output is the following:
rs587776794 4:55141030-55141031 GAGGGA 5156 NM_001347827.2 Transcript inframe_insertion 1811-1812 1676-1677 559 W/WRE tgg/tgGAGGGAg rs587776794 IMPACT=MODERATE;STRAND=1;VARIANT_CLASS=insertion;SYMBOL=PDGFRA;SYMBOL_SOURCE=EntrezGene;GIVEN_REF=-;USED_REF=-;EXON=12/17;HGVSc=NM_001347827.2:c.1681_1682insAGAGGG;HGVSp=NP_001334756.1:p.Arg560_Val561insGluArg;HGVS_OFFSET=5;CLIN_SIG=pathogenic;PHENO=1;PUBMED=12522257;VAR_SYNONYMS=ClinVar::RCV000014506,VCV000013547--OMIM::173490.0005
Thanks a lot. but if it is not allele specific, it will match many not wanted results, this maybe not a good
I am a little puzzuled.
I tried to normalize the variant, but failed
bgzip -c chr21_y.vcf > chr21_y.vcf.gz
tabix -p vcf chr21_y.vcf.gz
bcftools norm -m -both -o chr21_y_1.vcf chr21_y.vcf.gz
bcftools norm -f genome.fa -o chr21_y_2.vcf chr21_y_1.vcf
the vcf does not change , and I think it should be normalized, so that vep may handle it
chr21_y.vcf.zip
The variant is already normalized chr4 55141030 . G GGAGGGA
The missing ClinVar output is not being caused by the normalization. It's a mismatch of alt alleles, I am looking into it.
Another way to include ClinVar data into VEP is as custom annotation: https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html#custom_example The output for the normalized variant is:
rs587776794 4:55141030-55141031 GAGGGA ENSG00000134853 ENST00000257290.5 Transcript inframe_insertion 2007-2008 1676-1677 559 W/WRE tgg/tgGAGGGAg rs587776794 IMPACT=MODERATE;STRAND=1;VARIANT_CLASS=insertion;SYMBOL=PDGFRA;SYMBOL_SOURCE=HGNC;HGNC_ID=8803;CANONICAL=YES;CCDS=CCDS3495.1;GENE_PHENO=1;EXON=12/23;HGVSc=ENST00000257290.5:c.1681_1682insAGAGGG;HGVSp=ENSP00000257290.5:p.Arg560_Val561insGluArg;HGVS_OFFSET=5;PHENO=1;PUBMED=12522257;VAR_SYNONYMS=ClinVar::RCV000014506,VCV000013547--OMIM::173490.0005;ClinVar=13547;ClinVar_CLNDN=Gastrointestinal_stromal_tumor;ClinVar_CLNREVSTAT=no_assertion_criteria_provided;ClinVar_CLNSIG=Pathogenic;ClinVar_FILTER=.
but if it is normalize, why the G not moved, they have the same base G, I am puzzuled about this, thanks a lot
To normalize a variant the reference allele does not need to be different. rs587776794 is an insertion of AGAGGG at position chr4:55141035 Reference genome: TGGAGGGTC Insertion of AGAGGG: TGGAGGGAGAGGGTC
The most left representation is: TGGAGGGAGAGGGTC The alleles VCF representation is G/GGAGGGA You can read more about vt normalization here.
thanks a lot. I am still puzuled ,you also said The output for the normalized variant is:
rs587776794 4:55141030-55141031 GAGGGA but you also said The alleles VCF representation is G/GGAGGGA why not -/GAGGGA
-/GAGGGA
does not follow the VCF specifications.
From the VCF specification:
REF - reference base(s): Each base must be one of A,C,G,T,N (case insensitive). Multiple bases are permitted. The value in the POS field refers to the position of the first base in the String. For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT Strings must include the base before the event (which must be reflected in the POS field)
Thanks a lot. so the ref and alt can not be null in vcf, am I right?
In the context of variant representation, parsimony means representing a variant in as few nucleotides as possible without reducing the length of any allele to 0. It is a property describing the nature of the length of a variant's alleles and is defined as follows:
so D is not CA -, the have the same base G, but here alt can not null
is this issue still being in follow? thanks a lot
That is correct. In example D, in VCF representation the variant is represented as G/GCA being G the base before the occurrence.
about the former annotation, is there anything about it thanks a lot
Could you please explain what you mean?
thanks a lot
my question is here
thanks a lot.
it is just a guess,
how vep use the data in clinvar, because the 3 rule will also need to change the position of ref,
for example
<>
On the transcript, c.1_2delAT deletes AT from …AGGATGCG…, resulting in …AGGGCG…. There’s no ambiguity about what sequence was actually deleted.
c.1_3delATG deletes ATG, resulting in …AGGCG…. Note that you could also get this result by deleting GAT. <> so the ref postion in the ref base should also be changed, and clinvar keep the Chr Start End Ref Alt CLNALLELEID CLNDN CLNDISDB CLNREVSTAT CLNSIG <> when vep normalize a variant and then anno with clinvar, it should match from one record to another or from yes to no or from no to yes, <> it is just my guess, I am not clear about the inner process of vep
so is there a final decison about this
When the variants are normalized, VEP matches the ref and alt alleles with the ones from the ClinVar VCF file as described above (using custom annotation).
I think you are talking about two different ClinVar annotations. The ClinVar columns you shared are not obtained when you run your command.
Columns: Chr Start End Ref Alt CLNALLELEID CLNDN CLNDISDB CLNREVSTAT CLNSIG
Command: docker run --rm --user id -u:id -g -v /home/DATA/kobe:/data docker.io/ensemblorg/ensembl-vep vep --input_file /data/B.vcf --output_file /data/vep_B.vcf --dir_cache /data/database/anno/vep104 --dir_plugins /data/database/anno/vep104/VEP_plugins-release-104 --fasta /data/database/anno/vep104/Homo_sapiens.GRCh37.dna.primary_assembly_chr.fa --offline --cache --force_overwrite --transcript_version --refseq --assembly GRCh37 --format vcf --cache_version 104 --keep_csq --variant_class --vcf --sift b --polyphen b --ccds --hgvs --symbol --numbers --canonical --gene_phenotype --af_1kg --af_esp --af_gnomad --pubmed --var_synonyms --variant_class --fork 14 --check_existing --phased --numbers --xref_refseq --tsl --terms SO
The ClinVar annotation available in VEP (--check_existing) is not obtained from the ClinVar VCF file. Here you can read how this annotation is done: https://www.ensembl.org/info/docs/tools/vep/script/vep_other.html#colocated
thanks a lot so have you run this to check if the result is right?
Could you please explain which 'check' you mean? When you mention 'ClinVar annotation' which annotation do you mean, the default from vep or the custom annotation from the ClinVar VCF file?
Thanks a lot thanks a lot I think it is a bit conmfused. I guess we can go to the origin question,
chr4 55141030 . G GGAGGGA, can you annotate this variant in vep.
here is another variant.
Chr Start End Ref Alt Gene Type cHGVS pHGVS Transcript
chr17 7578471 7578482 GGGCGGGGGTGT - TP53 deletion c.448_459del p.Thr150_Pro153del NM_000546.6
do you think the Start End is right? I think is not, because in hg 19
chr17:7578470-7578485 CGGGCGGGGGTGTGGA it should move 2 base to the 3rule, should be chr17 7578473 7578484
Thanks for reporting the issue. There is a mismatch of alleles between Clinvar and VEP, we are going to work on a fix sometime next year. I'll let you know when there is any update.
Best wishes, Diana
I'm going to close this issue. Feel free to open a new issue if you have more questions.
Best wishes, Diana
Describe the issue
A clear and concise description of what the bug is.
Additional information
Please fill in the following sections to help us find the source of your issue as quickly as possible.
System