Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
456 stars 152 forks source link

HGVSp and silent changes (there should be no "=" signs in INFO filed) #430

Closed thedam closed 5 years ago

thedam commented 5 years ago

Hi, VCF specification says:

  1. INFO - additional information: (String, no white-space, semi-colons, or equals-signs permitted; commas are permitted only as delimiters for lists of values),

but HGVSp nomenclature says:

silent (no change)
NP_003997.1:p.Cys188=
amino acid Cys188 is not changed (DNA level change ..TGC.. to ..TGT..)
NOTE: the description p.= means the entire protein coding region was analysed and no variant was found that changes (or is predicted to change) the protein sequence.

so putting HGVSp notation of some silent change into ##INFO=<ID=CSQ, makes specifications conflict:)

Take a look on my example vcf: (written as txt as git doesn't allow other formats) ct.txt there is a CSQ tag with equal-sign in INFO field:

CSQ=C|downstream_gene_variant|MODIFIER|NOC2L|26155|Transcript|NM_015658.3|protein_coding||||||||||rs6672356|1|1752|-1||EntrezGene|||T|T|OK|||1|||0.9999|0.9994|0.9999|1|1|1|0.9999|1|0.9997|||||rs6672356|9.99804e-01,C|synonymous_variant|LOW|SAMD11|148398|Transcript|NM_152486.2|protein_coding|10/14||NM_152486.2:c.1027T>C|NP_689699.2:p.Arg343=|1107|1027|343|R|Cgg/Cgg|rs6672356|1||1||EntrezGene|||T|C|OK|||1|||0.9999|0.9994|0.9999|1|1|1|0.9999|1|0.9997|||||rs6672356|9.99804e-01

Later this make a conflict while using VCF-specification compatible tools like vcfR. Have a look on issue I've reported here: https://github.com/knausb/vcfR/issues/130

to solve it, I propose change this: NP_689699.2:p.Arg343= on this: NP_689699.2:p.Arg343Arg

Cheers Damian

at7 commented 5 years ago

Hello, could you please let me know if you are running vep with --no_escape? The default is to URI escape HGVS strings unless you specify --no_escape. Thanks, Anja

thedam commented 5 years ago

yes, I run it with "--no_escape"

at7 commented 5 years ago

Could you just drop the parameter or is there a reason why you need to run with --no_escape?

thedam commented 5 years ago

hmm long time ago I didn't like such notation: NP_000383.1:p.Phe39%3D and I've learned that --no_escape "repairs it". Ok, now I get the reason of this escape/no escape.

Thanks for clarification