wentgithub commented 4 years ago

thanks for this powerful tool. I have two questions, if I intall from git clone, and do the perl install.pl to install an old version before.

and now I want to install from conda of a new version, 1 will it affect the old version usuge? 2 how can I set the version of my vep to keep the same version on different machines conda install -c bioconda/label/cf201901 ensembl-vep

where can I find the tag like bioconda/label/cf201901

3 from which version does vep suppport the argument hgvsg, if I want to make the vep annotation result containning NM number with small version instead of ENST, what should I do?

4 in grch37 version, the database seems very old, will your team update it later or will keep the old forever

thanks a lot

wentgithub commented 4 years ago

the most important question is that I want to make the vcf output showing NM number, like NM123.5(containing small version), I found there are two arguments maybe connected to this

--merged and --refseq but I still do not find the file to download.

I used conda to install the vep, after excute vep, it shows

Possible precedence issue with control flow operator at /data/ngs/softs/conda/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.

----------------------------------

ENSEMBL VARIANT EFFECT PREDICTOR

----------------------------------

Versions: ensembl : 98.e98e194 ensembl-funcgen : 98.36eef94 ensembl-io : 98.052d23b ensembl-variation : 98.5f5ffce ensembl-vep : 98.3

and I used the a file **homo_sapiens_vep_87_GRCh37.tar.gz*** that I download before, the reference genome comes from the gatk bundle

my commnad line vep --dir ensembl-vep/ --cache --offline --cache_version 87 --assembly GRCh37 --fa ucsc.hg19.fasta --force_overwrite --vcf --variant_class --gene_phenotype --vcf_info_field ANN --hgvs --hgvsg --transcript_version -i input.vcf -o output.vep.vcf

there is a directory called homo_sapiens/ in the ensembl-vep, and the content is the unzip file of homo_sapiens_vep_87_GRCh37.tar.gz

in fact, I have already forgot what is homo_sapiens_vep_87_GRCh37.tar.gz, it downloaded when I run perl install.pl a long time ago, and if it is reference genome, what is the difference betweent ucsc.hg19.fasta and homo_sapiens_vep_87_GRCh37.tar.gz,
and where is the corrseponding --refseq file I can download, I find a website here, the two signed by red arrow, which is better ftp://ftp.ensembl.org/pub/release-98/variation/indexed_vep_cache/

Thanks a lot

wentgithub commented 4 years ago

MSG: ERROR: Cannot map to LRGs in offline mode

command line arguments contains --offline --cache_version 87 --lrg

but if connect to the internet, it is too slow, is there some easy way to use the lrg resource? thanks a lot

helensch commented 4 years ago

Hi

Thank you for your queries.

With reference to your initial queries.

If the vep github installation and conda installation are in different locations you should be able to have both versions installed. I recommend updating to the latest version of VEP (v98) While conda/bioconda installations of VEP exist, they are not maintained by us and as such not fully supported.

Get genomic HGVS nomenclature with --hgvsg was released in version 88. See https://www.ensembl.org/info/docs/tools/vep/script/vep_download.html#history.

There is also the --hgvs option https://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_hgvs

Information on the GRCh37 assembly in Ensembl is at https://www.ensembl.org/info/website/tutorials/grch37.html and in a recent blog post http://www.ensembl.info/2019/09/19/simplifying-our-grch37-services/

The dbSNP version on GRCh37 is currently b151.

Regards Helen

wentgithub commented 4 years ago

thanks a lot @helensch. 1 there is a sentence in your link The human assembly GRCh37 (also known as hg19), so the reference genome from gatk (ucsc.hg19.fasta) can also replace GRCH37_p13.fa, because I use (ucsc.hg19.fasta) in bwa , am I right

2 how can I use the lrg , is there something I need to download

3 about the annotation by vep, I found two maybe not consistent with hgvs(maybe my misunderstanding)

1. no dup base chr20 31022441 . A AG . PASS DP=2672;ECNT=1;POP_AF=5e-08;P_CONTAM=0;P_GERMLINE=-1.236;TLOD=3707.14;ANN=G|frameshift_variant|HIGH|ASXL1|ENSG00000171456|Transcript|ENST00000306058.5|protein_coding|12/12||ENST00000306058.5:c.1919dup|ENSP00000305119.5:p.Gly641TrpfsTer12|1911-1912|1911-1912|637-638|-/X|-/G|||1||insertion|HGNC|18318|1|8|chr20:g.31022449dup,G|frameshift_variant|HIGH|ASXL1|ENSG00000171456|Transcript|ENST00000375687.4|protein_coding|13/13||ENST00000375687.4:c.1934dup|ENSP00000364839.4:p.Gly646TrpfsTer12|2350-2351|1926-1927|642-643|-/X|-/G|||1||insertion|HGNC|18318|1|8|chr20:g.31022449dup,G|downstream_gene_variant|MODIFIER|ASXL1|ENSG00000171456|Transcript|ENST00000470145.1|processed_transcript|||||||||||4718|1||insertion|HGNC|18318|1||chr20:g.31022449dup,G|downstream_gene_variant|MODIFIER|ASXL1|ENSG00000171456|Transcript|ENST00000553345.1|protein_coding|||||||||||1228|1|cds_end_NF|insertion|HGNC|18318|1||chr20:g.31022449dup,G|downstream_gene_variant|MODIFIER|ASXL1|ENSG00000171456|Transcript|ENST00000555564.1|retained_intron|||||||||||1666|1||insertion|HGNC|18318|1||chr20:g.31022449dup GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:ORIGINAL_CONTIG_MISMATCH:SA_MAP_AF:SA_POST_PROB 0/1:1310,1260:0.491:2570:739,685:571,575:31,29:185,188:60:29:false:false:.:.:49.44:100:0:0.485,0.475,0.49:0.003291,0.019,0.978

you can see it give dup, but not giving the base G

the right maybe **

c.1934dupG

2 no del base for one del chr2 148683685 . TA T . PASS DP=2492;ECNT=1;POP_AF=5e-08;P_CONTAM=0;P_GERMLINE=-491.3;TLOD=408.36;ANN=-|frameshift_variant|HIGH|ACVR2A|ENSG00000121989|Transcript|ENST00000241416.7|protein_coding|10/11||ENST00000241416.7:c.1310del|ENSP00000241416.7:p.Lys437ArgfsTer5|1939|1303|435|K/X|Aaa/aa|||1||deletion|HGNC|173|1|7|chr2:g.148683693del,-|downstream_gene_variant|MODIFIER|ORC4|ENSG00000115947|Transcript|ENST00000392857.5|protein_coding|||||||||||4282|-1||deletion|HGNC|8490|1||chr2:g.148683693del,-|frameshift_variant|HIGH|ACVR2A|ENSG00000121989|Transcript|ENST00000404590.1|protein_coding|11/12||ENST00000404590.1:c.1310del**|ENSP00000384338.1:p.Lys437ArgfsTer5|1473|1303|435|K/X|Aaa/aa|||1||deletion|HGNC|173|1|7|chr2:g.148683693del,-|non_coding_transcript_exon_variant|MODIFIER|ACVR2A|ENSG00000121989|Transcript|ENST00000495775.1|processed_transcript|1/2||ENST00000495775.1:n.438del||431|||||||1||deletion|HGNC|173|1|7|chr2:g.148683693del,-|frameshift_variant|HIGH|ACVR2A|ENSG00000121989|Transcript|ENST00000535787.1|protein_coding|10/11||ENST00000535787.1:c.986del|ENSP00000439988.1:p.Lys329ArgfsTer5|1375|979|327|K/X|Aaa/aa|||1||deletion|HGNC|173|1|7|chr2:g.148683693del GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:ORIGINAL_CONTIG_MISMATCH:SA_MAP_AF:SA_POST_PROB 0/1:2272,144:0.06:2416:1249,90:1023,54:32,30:187,154:46:30:false:false:.:.:62.59:100:0:0.061,0.051,0.06:0.0009802,0.006608,0.992

the right maybe c.1310delA

--

ima23 commented 4 years ago

Hi @wentgithub,

To answer your second set of questions and will answer the LRG question in your last message:

You can download the merged and refseq cache using INSTALL.pl and selecting the corresponding cache number. Alternatively you can download them manually. More information can be found in the cache section: https://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html#cache

The precedence issue notification usually arises due to use of different underlying perl versions

homo_sapiens_vep_87_GRCh37.tar.gz is the Ensembl variation VEP cache released with Ensembl release 87. The latest release if 98 with the corresponding caches, the ones that you point to with the red arrows. The ucsc.hg19.fasta is the fasta file containing the human reference genome hg19.

The 87 version refeseq and merged vep cache files can be found here: ftp://ftp.ensembl.org/pub/release-87/variation/VEP/

We do recommend using latest data, is there something present in cache_version 87 (released in 2016) that is missing from 98?

Yes, it is correct, LRG option currently requires database connection. We will look into improving our documentation.

Relating to your last set of questions:

fasta: in principle if the fasta is representing the same reference sequence then it should not make a difference. To be safe, you could use the one provided by VEP with VEP.
LRG: you need database connection.
Our current hgvs annotation is compliant with the HGVS nomenclature which does not require the G to be noted after the dup or the A after del: https://varnomen.hgvs.org/recommendations/DNA/variant/duplication/ https://varnomen.hgvs.org/recommendations/DNA/variant/deletion/

There is a plugin, FlagLRG that will add the LRG ID matching either the RefSeq or Ensembl transcript IDs if this is something you want to do: https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html

Kind regards, Irina

wentgithub commented 4 years ago

thanks a lot @helensch

We do recommend using latest data, is there something present in cache_version 87 (released in 2016) that is missing from 98?

no, because about one year ago, I used the vep, and I forget the details why I install cache_version 87, maybe beacause at that time, I misunderstanded the GRCH37 version will stop at version 87,
1 though I have run the command line successfully , I am still a little confused. homo_sapiens_vep_87_GRCh37.tar.gz is a more fast replacement of GRCh37_p13.fa or ucsc.hg19.fasta, am I right?

2 must LRG used in online mode. I have installed the plugin, and do as the https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html#flaglrg told
wget ftp://ftp.ebi.ac.uk/pub/databases/lrgex/list_LRGs_transcripts_xrefs.txt

it seems that the file just has 1561 rows, I guess it maybe an example file

I tried to use the argument LRG vep --dir ensembl-vep --offline --cache_version 87 --assembly GRCh37 --fa ucsc.hg19.fasta --force_overwrite --vcf --variant_class --gene_phenotype --vcf_info_field ANN --hgvs --hgvsg --transcript_version --lrg list_LRGs_transcripts_xrefs.txt -i vep1.vcf -o test1.vep.vcf

it throws the error

EXCEPTION --------------------
MSG: ERROR: Cannot map to LRGs in offline mode

how should I reviese my command, thanks a lot

ima23 commented 4 years ago

Hi @wentgithub,

If there is no particular reason to use cache 87, then we recommend you use 98 GRCh37 cache, this will have the latest data. For example ClinVar data, one of the sources you highlighted in your first post, is from 2019 in 98 vs 2016 in 87.

No. homo_sapiens_vep_87_GRCh37.tar.gz is not a replacement of GRCh37_p13.fa or ucsc.hg19.fasta. The tar.gz is the cache for variation data while the .fa/.fasta are for the reference DNA sequence. For speed ups it is best to use the cache (un-tar the .tar.gz) and use a fasta (either .fa or .fasta). For example:

./vep -i input.vcf --ofline --cache --cache_version 98 --fasta Homo_sapiens.GRCh37.hg19.fasta ....

The ./vep --lrg option is different than the FlagLRG.pm plugin. The VEP --lrg option only works with database connection. If you what to run only in offline mode but want to have the LRG identifiers in the vep output as a separate column then FlagLRG.pm plugin will do that for you. The FlagLRG plugin is very much dependent on the transcript versions and they have to be a perfect match to the identifiers in list_LRGs_transcripts_xrefs.txt for the LRG idenfiers to be be extracted. The list_LRGs_transcripts_xrefs.txt file is based on 98 release.

./vep -i input.vcf --ofline --cache --cache_version 98 --fasta Homo_sapiens.GRCh37.hg19.fasta .... -plugin FlagLRG,list_LRGs_transcripts_xrefs.txt

As the error says, VEP --lrg can't be run in offline mode.

Kind regards, Irina

wentgithub commented 4 years ago

Thanks a lot for your answer @ima123. I have another several questions want to confrim with you.

1 just like you proveide the Calculated variant consequences list in https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html, is there a BIOTYPE list detailed as that.
BIOTYPE rfefered to the following

INFO=

2 you proveide the Calculated variant consequences list in https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html whether here lacks a arrows points to the upper chromosome region(I add a red arrow)

just like the annovar first do whether the variant hit exons or hit intergenic regions, or hit introns, or hit a non-coding RNA genes

due to some unknown reason, some pictures not show. so I split the question to the next comment

wentgithub commented 4 years ago

then it detailed explanation of these exonic_variant_functoin annotations

3 here I want to ask whether the Consequence synonymous_variant is Exactly the same as synonymous SNV provided by annovar, because I need to exclude synonymous SNV in Tumor Mutational Burden (TMB) calcaulation. 4 I also want to confirm with you if I want to just choose variants in exonic or splicing region.is it the red designated area. especally whether it contains start_retained_variant and start_lost, excluding start_lost.(I read the doc and guess so , but want to be sure) 5 when I run command line (I put homo_sapiens and homo_sapiens_refseq in directory vep_grch37 ) vep --dir vep_grch37 --cache --offline --cache_version 98 --refseq --canonical --biotype --show_ref_allele --assembly GRCh37 --fa ucsc.hg19.fasta --force_overwrite --vcf --variant_class --gene_phenotype --vcf_info_field ANN --hgvs --hgvsg --transcript_version -in1.norm.vcf -o out2.vcf. it throws the following tips Possible precedence issue with control flow operator at /data/ngs/softs/conda/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805. 2019-12-11 20:16:33 - INFO: BAM-edited cache detected, enabling --use_transcript_ref; use --use_given_ref to override this

the perl issue you have answered becaused of version, it seems may not affect the result. but the second one , whether it means I forget to add the argument use_transcript_ref? or what should I do?

6 annovar provide vcf and txt formats results, I aslo want to find one argument like txt in vep, but failed, is there one? (the txt format write the results more easy to read) thanks a lot

ens-emily commented 4 years ago

A list of Biotypes used to describe Ensembl transcripts can be found here: https://www.ensembl.org/info/genome/genebuild/biotypes.html
The box you indicate contains the sequence ontology consequences that cannot be plotted against that gene model. The table below the diagram on the documentation page lists the descriptions of the consequence terms: https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html
Synonymous_variant is the official terminology used by sequence ontology, SO:0001819, as it shows in the table I linked above. From the image you include, it shows that this is also the SO term that Annovar link to their non-official term, synonymous SNV.
The exonic region includes anything that falls within the boxes in the diagram, both the coloured and non-coloured boxes. This means that the 5' and 3' UTR variants, and anything affected the start and end are exonic. What you choose to include in your analyses is up to you and I cannot advise you on this, only on what our data mean.
--use_transcript_ref or --use_given_ref are important when using the RefSeq cache because RefSeq transcripts do not necessarily match the reference genome. This means that if your input data includes a variant at a locus where the RefSeq transcript does not match the reference genome, the VEP has to make a choice about which allele it uses as the reference allele, and which it uses as the alternative. To make it use the allele that was included in the RefSeq transcript, add --use_transcript_ref to your VEP command; to make it use the allele that you gave it, add --use_given_ref.
The output options for VEP are described here: https://www.ensembl.org/info/docs/tools/vep/vep_formats.html#defaultout

wentgithub commented 4 years ago

thanks a lot. vep team is really a great team.

--use_transcript_ref or --use_given_ref are important when using the RefSeq cache because RefSeq transcripts do not necessarily match the reference genome. This means that if your input data includes a variant at a locus where the RefSeq transcript does not match the reference genome, the VEP has to make a choice about which allele it uses as the reference allele, and which it uses as the alternative. To make it use the allele that was included in the RefSeq transcript, add --use_transcript_ref to your VEP command; to make it use the allele that you gave it, add --use_given_ref.

so which one is recommend? thanks a lot

ens-emily commented 4 years ago

It really depends what is more important to your downstream analysis. If it is important that the annotation reflects the change to the RefSeq sequence as it is, --use_transcript_ref. If it is important that the annotation reflects the change to the reference genome sequence, --use_given_ref.

wentgithub commented 4 years ago

I am a little Entanglement. both seems important, I am more concerned which is more close to hgvs, is there a recommended one now, maybe hard for you too, but I think you are more experienced. thanks

ens-emily commented 4 years ago

We cannot make recommendations on this, only explain the options.

wentgithub commented 4 years ago

is there two columns like in annovar "funcRefgene", "exonicfuncRefgene" in vep result, it seems that there is only Consequence corresponding to "funcRefgene", but there is no region corresponding to "exonicfuncRefgene".

The exonic region includes anything that falls within the boxes in the diagram, both the coloured and non-coloured boxes. This means that the 5' and 3' UTR variants, and anything affected the start and end are exonic.

I am here still want to confirm with you which are exonic region. I think it may exactly contains the following 12 colored by me (1,2,3), is that right?

thanks @ens-emily @dzerbino @helensch @ima23

worker000000 commented 4 years ago

there are many annotaton has %3D, I tried to find something with this, but failed. can you help me to figure out what it mean? thanks a lot

worker000000 commented 4 years ago

a variant "chr12 25398285 . C G", annovar gives the NM_004985.5，and the Reference answer is also NM_004985.5。
but vep version 98(both vep command version 98 and homo_sapiens_refseq_vep_98_GRCh37.tar.gz, homo_sapiens_vep_98_GRCh37.tar.gz) gives the
ANN=G|missense_variant|MODERATE|KRAS|3845|Transcript|NM_004985.3|protein_coding|2/5||NM_004985.3:c.34G>C|NP_004976.2:p.Gly12Arg|215|34|12|G/R|Ggt/Cgt|||-1||SNV|EntrezGene|||C|C||||chr12:g.25398285C>G,G|missense_variant|MODERATE|KRAS|3845|Transcript|NM_004985.4|protein_coding|2/5||NM_004985.4:c.34G>C|NP_004976.2:p.Gly12Arg|226|34|12|G/R|Ggt/Cgt|||-1||SNV|EntrezGene||rseq_mrna_nonmatch&rseq_3p_mismatch|C|C|OK|||chr12:g.25398285C>G,G|missense_variant|MODERATE|KRAS|3845|Transcript|NM_033360.2|protein_coding|2/6||NM_033360.2:c.34G>C|NP_203524.1:p.Gly12Arg|215|34|12|G/R|Ggt/Cgt|||-1||SNV|EntrezGene|||C|C||||chr12:g.25398285C>G,G|missense_variant|MODERATE|KRAS|3845|Transcript|NM_033360.3|protein_coding|2/6||NM_033360.3:c.34G>C|NP_203524.1:p.Gly12Arg|226|34|12|G/R|Ggt/Cgt|||-1||SNV|EntrezGene||rseq_mrna_nonmatch&rseq_3p_mismatch|C|C|OK|||chr12:g.25398285C>G,G|missense_variant|MODERATE|KRAS|3845|Transcript|XM_005253365.1|protein_coding|2/5||XM_005253365.1:c.34G>C|XP_005253422.1:p.Gly12Arg|231|34|12|G/R|Ggt/Cgt|||-1||SNV|EntrezGene|||C|C||||chr12:g.25398285C>G

thanks a lot

ens-emily commented 4 years ago

@wentgithub As I said earlier, all of the terms listed in the boxes that point to a rectangle, either a coloured rectangle or an empty rectangle are exonic. That means everything in the three boxes you have highlighted plus 5' UTR and 3' UTR.

ens-emily commented 4 years ago

@2236529177

"%3D" is the ASCII code for "=", see http://www.asciitable.com/. Somehow the application you've opened this in has converted this. It means the variants are synonymous.

The VEP will give you every effect on every transcript a variant hits, which is why you see multiple results. The GRCh37 database is out of date, so only gives you data for the older versions of that RefSeq transcript: NM_004985.3 and NM_004985.4. To get up-to-date data we recommend you update your data to the newer GRCh38 genome and access the current database.

worker000000 commented 4 years ago

yes, I see that, I also do not know why not =, but %3D, no matter in linux, or visual studio, or office, it shows %3D.
as you know, annovar also has grch37 and grch38 version, but it gives the right version, and in clinical gene test, grch37 is the most widely used, so I have to use 37 insetead of 38. I guess it maybe something else that vep misses the very transcript

thanks a lot

ima23 commented 4 years ago

I will close this ticket now as the questions have been answered. Please feel free to open another ticket or send an email to helpdesk@ensembl.org if you have any more questions. Kind regards, Irina

Ensembl / ensembl-vep

can I install different version of vep in different directory? #659

----------------------------------

ENSEMBL VARIANT EFFECT PREDICTOR

----------------------------------

c.1934dupG

INFO=

due to some unknown reason, some pictures not show. so I split the question to the next comment