Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
456 stars 153 forks source link

will hgvs affect clinvar annotation #1060

Closed asmlgkj closed 2 years ago

asmlgkj commented 3 years ago

Describe the issue

A clear and concise description of what the bug is.

Additional information

Please fill in the following sections to help us find the source of your issue as quickly as possible.

System

dglemos commented 3 years ago

Hi @asmlgkj, The HGVS being shifted should not interfere with the ClinVar results on VEP. Could you please provide an example?

asmlgkj commented 3 years ago

@dglemos thanks a lot. it is just a guess, how vep use the data in clinvar, because the 3 rule will also need to change the position of ref, for example <> On the transcript, c.1_2delAT deletes AT from …AGGATGCG…, resulting in …AGGGCG…. There’s no ambiguity about what sequence was actually deleted.

c.1_3delATG deletes ATG, resulting in …AGGCG…. Note that you could also get this result by deleting GAT. <> so the ref postion in the ref base should also be changed, and clinvar keep the Chr Start End Ref Alt CLNALLELEID CLNDN CLNDISDB CLNREVSTAT CLNSIG <> when vep normalize a variant and then anno with clinvar, it should match from one record to another or from yes to no or from no to yes, <> it is just my guess, I am not clear about the inner process of vep

dglemos commented 3 years ago

Thank you for the example. The output can have missing ClinVar results when using the ClinVar data (here) as custom annotation. VEP does not left-normalize insertion or deletion variants in repetitive sequence however, the ClinVar VCF is left normalized. As an example, for the VCF input:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
21      37490194        rs1555984102    TAT     T       .       .       .

The VEP output using ClinVar custom annotation doesn't include any ClinVar data. However, this variant is found in the ClinVar VCF file:

21      37490193        520853  TTA     T       .       .       ALLELEID=512486;CLNDISDB=MONDO:MONDO:0013578,MedGen:C3279839,OMIM:614104,Orphanet:ORPHA464306|MONDO:MONDO:0100038,MedGen:CN283328|MeSH:D030342,MedGen:C0950123;CLNDN=Mental_retardation,_autosomal_dominant_7|Complex_neurodevelopmental_disorder|Inborn_genetic_diseases;CLNHGVS=NC_000021.9:g.37490195_37490196del;CLNREVSTAT=criteria_provided,_multiple_submitters,_no_conflicts;CLNSIG=Pathogenic;CLNVC=Deletion;CLNVCSO=SO:0000159;CLNVI=GenomeConnect_-_Simons_Searchlight:16886-x1_1;GENEINFO=DYRK1A:1859;MC=SO:0001589|frameshift_variant;ORIGIN=33;RS=1555984102

A workaround is to pre-process the VCF input before running VEP with tools such as is vt normalize. You can read more about this tool here: https://genome.sph.umich.edu/wiki/Vt

The same variant after being processed by vt tool:

21      37490193        rs1555984102    TTA     T       .       .       OLD_VARIANT=21:37490194:TAT/T

The VEP output includes ClinVar data:

rs1555984102    21:37490194-37490195    -       ENSG00000157540 ENST00000338785 Transcript      frameshift_variant      933-934 684-685 228-229 FM/FX   ttTAtg/tttg     rs1555984102    IMPACT=HIGH;STRAND=1;CLIN_SIG=pathogenic;PHENO=1;ClinVar=520853;ClinVar_CLNDN=Mental_retardation,_autosomal_dominant_7|Complex_neurodevelopmental_disorder|Inborn_genetic_diseases;ClinVar_CLNREVSTAT=criteria_provided,_multiple_submitters,_no_conflicts;ClinVar_CLNSIG=Pathogenic;ClinVar_FILTER=.
asmlgkj commented 3 years ago

Thank you for the example. I havae 2 questions here. 1
tools like vt or bcftools norm do left norm, just like database clinvar, they do the same behaviour hgvs do 3 rule (right norm) , your example seems to does not have something to do with hgvs <> 2 I make a fake variant in vcf, the first record in my vcf. both vt and bcftools will report the Reference allele mismatch at chr21:37490194 .. REF_SEQ:'ACA' vs VCF:'TAT', so it seems not being able to norm this variant. <> here is the vcf, I zipped it for uploading, you can unzip and will get the vcf. chr21.zip

asmlgkj commented 3 years ago

here is example of a special example, vep missed both clin_sig, but annovar get the right clin_sig of origin one

it is a record in clinvar chr4 55141030 55141030 - GAGGGA 28586 Gastrointestinal_stromal_tumor Human_Phenotype_Ontology:HP:0100723,MONDO:MONDO:0011719,MeSH:D046152,MedGen:C0238198,OMIM:606764,Orphanet:ORPHA44890 no_assertion_criteria_provided Pathogenic

here is the before normalize variant chr4 55141030 . G GGAGGGA vep output is

INFO=

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT FL202109121

chr4 55141030 . G GGAGGGA 45 f0.02 SAMPLE=FL202109121;TYPE=SNV;DP=710;END=46957784;VD=2;AF=0.0028;BIAS=2:0;REFBIAS=509:199;VARBIAS=2:0;PMEAN=33;PSTD=1;QUAL=45;QSTD=0;SBF=1;ODDRATIO=0;MQ=60;SN=4;HIAF=0.0028;ADJAF=0;SHIFT3=0;MSI=1;MSILEN=1;NM=2.5;HICNT=2;HICOV=710;LSEQ=AAGCCGTAGAAGCAAAGGTA;RSEQ=CACACGAGGTGCCGCCAGGA;DUPRATE=0;SPLITREAD=0;SPANPAIR=0;CSQ=GAGGGA|inframe_insertion|MODERATE|PDGFRA|5156|Transcript|NM_001347828.2|protein_coding|13/24||NM_001347828.2:c.1756_1757insAGAGGG|NP_001334757.1:p.Arg585_Val586insGluArg|1889-1890|1751-1752|584|W/WRE|tgg/tgGAGGGAg|||1||insertion|EntrezGene||YES|||NP_001334757.1|||||||||||||5||||||||||||||||||||| GT:DP:VD:AD:AF:RD:ALD 0/0:710:2:708,2:0.0028:509,199:2,0

annovar output is

Chr Start End Ref Alt Func.refGeneWithVer Gene.refGeneWithVer GeneDetail.refGeneWithVer ExonicFunc.refGeneWithVer AAChange.refGeneWithVer 1000g2015aug_all CLNALLELEID CLNDN CLNDISDB CLNREVSTAT CLNSIG avsnp150 cosmic70 Otherinfo chr4 55141030 55141030 - GAGGGA exonic PDGFRA . nonframeshift insertion PDGFRA:NM_006206.5:exon12:c.1676_1677insGAGGGA:p.R560_V561insER . 28586 Gastrointestinal_stromal_tumor Human_Phenotype_Ontology:HP:0100723,MONDO:MONDO:0011719,MeSH:D046152,MedGen:C0238198,OMIM:606764,Orphanet:ORPHA44890 no_assertion_criteria_provided Pathogenic rs587776794 . 0 45 710 chr4 55141030 . G GGAGGGA 45 f0.02 SAMPLE=FL202109121;TYPE=SNV;DP=710;END=46957784;VD=2;AF=0.0028;BIAS=2:0;REFBIAS=509:199;VARBIAS=2:0;PMEAN=33;PSTD=1;QUAL=45;QSTD=0;SBF=1;ODDRATIO=0;MQ=60;SN=4;HIAF=0.0028;ADJAF=0;SHIFT3=0;MSI=1;MSILEN=1;NM=2.5;HICNT=2;HICOV=710;LSEQ=AAGCCGTAGAAGCAAAGGTA;RSEQ=CACACGAGGTGCCGCCAGGA;DUPRATE=0;SPLITREAD=0;SPANPAIR=0 GT:DP:VD:AD:AF:RD:ALD 0/0:710:2:708,2:0.0028:509,199:2,0 <> <>

here is the after normalize variant chr4 55141035 . AGAGGG vep output is

INFO= CHROM POS ID REF ALT QUAL FILTER INFO FORMAT FL202109121 chr4 55141035 . AGAGGG 45 f0.02 SAMPLE=FL202109121;TYPE=SNV;DP=710;END=46957784;VD=2;AF=0.0028;BIAS=2:0;REFBIAS=509:199;VARBIAS=2:0;PMEAN=33;PSTD=1;QUAL=45;QSTD=0;SBF=1;ODDRATIO=0;MQ=60;SN=4;HIAF=0.0028;ADJAF=0;SHIFT3=0;MSI=1;MSILEN=1;NM=2.5;HICNT=2;HICOV=710;LSEQ=AAGCCGTAGAAGCAAAGGTA;RSEQ=CACACGAGGTGCCGCCAGGA;DUPRATE=0;SPLITREAD=0;SPANPAIR=0;CSQ=AGAGGG|inframe_insertion|MODERATE|PDGFRA|5156|Transcript|NM_001347828.2|protein_coding|13/24||NM_001347828.2:c.1755_1756insAGAGGG|NP_001334757.1:p.Arg585_Val586insArgGly|1893-1894|1755-1756|585-586|-/RG|-/AGAGGG|||1||sequence_alteration|EntrezGene||YES|||NP_001334757.1|||||||||||||||||||||||||||||||||| GT:DP:VD:AD:AF:RD:ALD 0/0:710:2:708,2:0.0028:509,199:2,0

annovar output is

Chr Start End Ref Alt Func.refGeneWithVer Gene.refGeneWithVer GeneDetail.refGeneWithVer ExonicFunc.refGeneWithVer AAChange.refGeneWithVer 1000g2015aug_all CLNALLELEID CLNDN CLNDISDB CLNREVSTAT CLNSIG avsnp150 cosmic70 Otherinfo chr4 55141034 55141034 - AGAGGG exonic PDGFRA . nonframeshift insertion PDGFRA:NM_006206.5:exon12:c.1680_1681insAGAGGG:p.R560_V561insRG . . . . . . . . 0 45 710 chr4 55141035 . AGAGGG 45 f0.02 SAMPLE=FL202109121;TYPE=SNV;DP=710;END=46957784;VD=2;AF=0.0028;BIAS=2:0;REFBIAS=509:199;VARBIAS=2:0;PMEAN=33;PSTD=1;QUAL=45;QSTD=0;SBF=1;ODDRATIO=0;MQ=60;SN=4;HIAF=0.0028;ADJAF=0;SHIFT3=0;MSI=1;MSILEN=1;NM=2.5;HICNT=2;HICOV=710;LSEQ=AAGCCGTAGAAGCAAAGGTA;RSEQ=CACACGAGGTGCCGCCAGGA;DUPRATE=0;SPLITREAD=0;SPANPAIR=0 GT:DP:VD:AD:AF:RD:ALD 0/0:710:2:708,2:0.0028:509,199:2,0

asmlgkj commented 3 years ago

any comment about this? thanks a lot

dglemos commented 3 years ago

tools like vt or bcftools norm do left norm, just like database clinvar, they do the same behaviour hgvs do 3 rule (right norm) , your example seems to does not have something to do with hgvs

Yes, my example is not related to HGVS. The HGVS shifting shouldn't affect the ClinVar annotation.

We are checking why the variant chr4 55141030 . G GGAGGGA does not have ClinVar annotation using the custom option. In the meantime, if you are interested in phenotype data it is possible to obtain this data with the VEP plugin Phenotypes.

dglemos commented 3 years ago

Could you please share which VEP command you run to annotate the variant chr4 55141030 G GAGGGA?

asmlgkj commented 3 years ago

thanks a lot <> docker run --rm --user id -u:id -g -v /home/DATA/kobe:/data docker.io/ensemblorg/ensembl-vep vep --input_file /data/B.vcf --output_file /data/vep_B.vcf --dir_cache /data/database/anno/vep104 --dir_plugins /data/database/anno/vep104/VEP_plugins-release-104 --fasta /data/database/anno/vep104/Homo_sapiens.GRCh37.dna.primary_assembly_chr.fa --offline --cache --force_overwrite --transcript_version --refseq --assembly GRCh37 --format vcf --cache_version 104 --keep_csq --variant_class --vcf --sift b --polyphen b --ccds --hgvs --symbol --numbers --canonical --gene_phenotype --af_1kg --af_esp --af_gnomad --pubmed --var_synonyms --variant_class --fork 14 --check_existing --phased --numbers --xref_refseq --tsl --terms SO

dglemos commented 3 years ago

By default, VEP returns allele specific clinical significance. For variant 4 55141030 rs587776794 G GGAGGGA the alt allele is GGAGGGA however, alt allele from ClinVar is AGAGGG which doesn't match GGAGGGA. If you run VEP with option --clin_sig_allele 0 VEP will provide all known clinical significance values at the given locus without checking the alt allele. The output is the following:

rs587776794     4:55141030-55141031     GAGGGA  5156    NM_001347827.2  Transcript      inframe_insertion       1811-1812       1676-1677       559     W/WRE   tgg/tgGAGGGAg   rs587776794     IMPACT=MODERATE;STRAND=1;VARIANT_CLASS=insertion;SYMBOL=PDGFRA;SYMBOL_SOURCE=EntrezGene;GIVEN_REF=-;USED_REF=-;EXON=12/17;HGVSc=NM_001347827.2:c.1681_1682insAGAGGG;HGVSp=NP_001334756.1:p.Arg560_Val561insGluArg;HGVS_OFFSET=5;CLIN_SIG=pathogenic;PHENO=1;PUBMED=12522257;VAR_SYNONYMS=ClinVar::RCV000014506,VCV000013547--OMIM::173490.0005
asmlgkj commented 3 years ago

Thanks a lot. but if it is not allele specific, it will match many not wanted results, this maybe not a good

I am a little puzzuled. I tried to normalize the variant, but failed

bgzip -c chr21_y.vcf > chr21_y.vcf.gz tabix -p vcf chr21_y.vcf.gz bcftools norm -m -both -o chr21_y_1.vcf chr21_y.vcf.gz bcftools norm -f genome.fa -o chr21_y_2.vcf chr21_y_1.vcf
the vcf does not change , and I think it should be normalized, so that vep may handle it chr21_y.vcf.zip

dglemos commented 3 years ago

The variant is already normalized chr4 55141030 . G GGAGGGA The missing ClinVar output is not being caused by the normalization. It's a mismatch of alt alleles, I am looking into it.

Another way to include ClinVar data into VEP is as custom annotation: https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html#custom_example The output for the normalized variant is:

rs587776794     4:55141030-55141031     GAGGGA  ENSG00000134853 ENST00000257290.5       Transcript      inframe_insertion       2007-2008       1676-1677       559     W/WRE   tgg/tgGAGGGAg   rs587776794     IMPACT=MODERATE;STRAND=1;VARIANT_CLASS=insertion;SYMBOL=PDGFRA;SYMBOL_SOURCE=HGNC;HGNC_ID=8803;CANONICAL=YES;CCDS=CCDS3495.1;GENE_PHENO=1;EXON=12/23;HGVSc=ENST00000257290.5:c.1681_1682insAGAGGG;HGVSp=ENSP00000257290.5:p.Arg560_Val561insGluArg;HGVS_OFFSET=5;PHENO=1;PUBMED=12522257;VAR_SYNONYMS=ClinVar::RCV000014506,VCV000013547--OMIM::173490.0005;ClinVar=13547;ClinVar_CLNDN=Gastrointestinal_stromal_tumor;ClinVar_CLNREVSTAT=no_assertion_criteria_provided;ClinVar_CLNSIG=Pathogenic;ClinVar_FILTER=.
asmlgkj commented 3 years ago

but if it is normalize, why the G not moved, they have the same base G, I am puzzuled about this, thanks a lot

dglemos commented 3 years ago

To normalize a variant the reference allele does not need to be different. rs587776794 is an insertion of AGAGGG at position chr4:55141035 Reference genome: TGGAGGGTC Insertion of AGAGGG: TGGAGGGAGAGGGTC

The most left representation is: TGGAGGGAGAGGGTC The alleles VCF representation is G/GGAGGGA You can read more about vt normalization here.

asmlgkj commented 3 years ago

thanks a lot. I am still puzuled ,you also said The output for the normalized variant is:

rs587776794 4:55141030-55141031 GAGGGA but you also said The alleles VCF representation is G/GGAGGGA why not -/GAGGGA image

dglemos commented 3 years ago

-/GAGGGA does not follow the VCF specifications.

From the VCF specification:

REF - reference base(s): Each base must be one of A,C,G,T,N (case insensitive). Multiple bases are permitted. The value in the POS field refers to the position of the first base in the String. For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT Strings must include the base before the event (which must be reflected in the POS field)

asmlgkj commented 3 years ago

Thanks a lot. so the ref and alt can not be null in vcf, am I right? In the context of variant representation, parsimony means representing a variant in as few nucleotides as possible without reducing the length of any allele to 0. It is a property describing the nature of the length of a variant's alleles and is defined as follows:
so D is not CA -, the have the same base G, but here alt can not null image

asmlgkj commented 3 years ago

is this issue still being in follow? thanks a lot

dglemos commented 3 years ago

That is correct. In example D, in VCF representation the variant is represented as G/GCA being G the base before the occurrence.

asmlgkj commented 3 years ago

about the former annotation, is there anything about it thanks a lot

dglemos commented 3 years ago

Could you please explain what you mean?

asmlgkj commented 3 years ago

thanks a lot my question is here

thanks a lot. it is just a guess, how vep use the data in clinvar, because the 3 rule will also need to change the position of ref, for example <> On the transcript, c.1_2delAT deletes AT from …AGGATGCG…, resulting in …AGGGCG…. There’s no ambiguity about what sequence was actually deleted.

c.1_3delATG deletes ATG, resulting in …AGGCG…. Note that you could also get this result by deleting GAT. <> so the ref postion in the ref base should also be changed, and clinvar keep the Chr Start End Ref Alt CLNALLELEID CLNDN CLNDISDB CLNREVSTAT CLNSIG <> when vep normalize a variant and then anno with clinvar, it should match from one record to another or from yes to no or from no to yes, <> it is just my guess, I am not clear about the inner process of vep



so is there a final decison about this

dglemos commented 3 years ago

When the variants are normalized, VEP matches the ref and alt alleles with the ones from the ClinVar VCF file as described above (using custom annotation). I think you are talking about two different ClinVar annotations. The ClinVar columns you shared are not obtained when you run your command. Columns: Chr Start End Ref Alt CLNALLELEID CLNDN CLNDISDB CLNREVSTAT CLNSIG Command: docker run --rm --user id -u:id -g -v /home/DATA/kobe:/data docker.io/ensemblorg/ensembl-vep vep --input_file /data/B.vcf --output_file /data/vep_B.vcf --dir_cache /data/database/anno/vep104 --dir_plugins /data/database/anno/vep104/VEP_plugins-release-104 --fasta /data/database/anno/vep104/Homo_sapiens.GRCh37.dna.primary_assembly_chr.fa --offline --cache --force_overwrite --transcript_version --refseq --assembly GRCh37 --format vcf --cache_version 104 --keep_csq --variant_class --vcf --sift b --polyphen b --ccds --hgvs --symbol --numbers --canonical --gene_phenotype --af_1kg --af_esp --af_gnomad --pubmed --var_synonyms --variant_class --fork 14 --check_existing --phased --numbers --xref_refseq --tsl --terms SO The ClinVar annotation available in VEP (--check_existing) is not obtained from the ClinVar VCF file. Here you can read how this annotation is done: https://www.ensembl.org/info/docs/tools/vep/script/vep_other.html#colocated

asmlgkj commented 3 years ago

thanks a lot so have you run this to check if the result is right?

dglemos commented 3 years ago

Could you please explain which 'check' you mean? When you mention 'ClinVar annotation' which annotation do you mean, the default from vep or the custom annotation from the ClinVar VCF file?

asmlgkj commented 3 years ago

Thanks a lot thanks a lot I think it is a bit conmfused. I guess we can go to the origin question,

chr4 55141030 . G GGAGGGA, can you annotate this variant in vep.

here is another variant. Chr Start End Ref Alt Gene Type cHGVS pHGVS Transcript
chr17 7578471 7578482 GGGCGGGGGTGT - TP53 deletion c.448_459del p.Thr150_Pro153del NM_000546.6
do you think the Start End is right? I think is not, because in hg 19

chr17:7578470-7578485 CGGGCGGGGGTGTGGA it should move 2 base to the 3rule, should be chr17 7578473 7578484

dglemos commented 2 years ago

Thanks for reporting the issue. There is a mismatch of alleles between Clinvar and VEP, we are going to work on a fix sometime next year. I'll let you know when there is any update.

Best wishes, Diana

dglemos commented 2 years ago

I'm going to close this issue. Feel free to open a new issue if you have more questions.

Best wishes, Diana