Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
445 stars 151 forks source link

Output of VEP (variant_effect_output.txt) for covid-19 has only 1 variation consequence while tool snpEff gives 12 variation consequences #725

Closed etapanari closed 4 years ago

etapanari commented 4 years ago

Hi there,

I have ran VEP for covid-19 using 1) a gff file 2) genome reference file 3) variation file of ensembl format

and I get back a very poor variation annotation.

Is it something that I am doing wrong? Is it maybe that VEP doesn't work properly for viral genomes?

A colleague tried another variation annotation tool snpEff and it gave her a much richer annotation for the variations.

I have the feeling that when using VEP, it ignores the GFF file I provide because I don't even see transcipt or gene annotations in the results.

aparton commented 4 years ago

Hi,

VEP does support the use of GFF files for custom annotations with the —custom flag. You can find more information about this here: https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html

If you’re still having trouble, if you could send me a copy of your GFF file and your input then I can take a closer look.

Kind Regards, Andrew

On 31 Mar 2020, at 17:53, Electra notifications@github.com wrote:

Hi there,

I have ran VEP for covid-19 using

a gff file genome reference file variation file of ensembl format and I get back a very poor variation annotation.

Is is something that I am doing wrong? Is it maybe that VEP doesn't work properly for viral genomes?

A colleague tried another variation annotation tool snpEff and it gave her a much richer annotation for the variations.

I have the feeling that when using VEP, it ignores the GFF file I provide because I don't even see transcipt or gene annotations in the results.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Ensembl/ensembl-vep/issues/725, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH56GN4DHNGWKM3NLQICXGLRKINZPANCNFSM4LXX2A3Q.

etapanari commented 4 years ago

Thanks Andrew, I run VEP like this:

vep -i covid_19_variation_vep.txt -gff covid_19.gff.gz --fasta NC_045512v2.fa.masked.gz --verbose --species covid-19

Do I need to add the flag --custom ?

aparton commented 4 years ago

Hi Electra,

Your current input looks good. The —gff flag uses the —custom functionality, so your input command looks fine.

I’m happy to take a closer look if you’re able to send me a sample of your input files which I can use to reproduce the issue.

Kind Regards, Andrew

On 31 Mar 2020, at 18:02, Electra notifications@github.com wrote:

Thanks Andrew, I run VEP like this:

vep -i covid_19_variation_vep.txt -gff covid_19.gff.gz --fasta NC_045512v2.fa.masked.gz --verbose --species covid-19

Do I need to add the flag --custom ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Ensembl/ensembl-vep/issues/725#issuecomment-606752834, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH56GN2WLF5RXU65CEDC7KDRKIO2FANCNFSM4LXX2A3Q.

etapanari commented 4 years ago

Hi Andrew,

Thanks so much for the prompt reply. This is a head of the gff file which I have ordered,compressed and tabixed:

zcat covid_19.gff.gz | head NC_045512v2 RefSeq five_prime_UTR 1 265 . + . ID=id-NC_045512v2:1..265;gbkey=5'UTR NC_045512v2 RefSeq region 1 29903 . + . ID=NC_045512v2:1..29903;Dbxref=taxon:2697049;collection-date=Dec-2019;country=China;gbkey=Src;genome=genomic;isolate=Wuhan-Hu-1;mol_type=genomic RNA;nat-host=Homo sapiens NC_045512v2 RefSeq CDS 266 13468 . + 0 ID=cds-YP_009724389.1;Parent=gene-GU280_gp01;Dbxref=Genbank:YP_009724389.1,GeneID:43740578;Name=YP_009724389.1;Note=pp1ab%3B translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=orf1ab;locus_tag=GU280_gp01;product=orf1ab polyprotein;protein_id=YP_009724389.1 NC_045512v2 RefSeq CDS 266 13483 . + 0 ID=cds-YP_009725295.1;Parent=gene-GU280_gp01;Dbxref=Genbank:YP_009725295.1,GeneID:43740578;Name=YP_009725295.1;Note=pp1a;gbkey=CDS;gene=orf1ab;locus_tag=GU280_gp01;product=orf1a polyprotein;protein_id=YP_009725295.1 NC_045512v2 RefSeq gene 266 21555 . + . ID=gene-GU280_gp01;Dbxref=GeneID:43740578;Name=orf1ab;gbkey=Gene;gene=orf1ab;gene_biotype=protein_coding;locus_tag=GU280_gp01 NC_045512v2 RefSeq CDS 13468 21555 . + 0 ID=cds-YP_009724389.1;Parent=gene-GU280_gp01;Dbxref=Genbank:YP_009724389.1,GeneID:43740578;Name=YP_009724389.1;Note=pp1ab%3B translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=orf1ab;locus_tag=GU280_gp01;product=orf1ab polyprotein;protein_id=YP_009724389.1 NC_045512v2 RefSeq CDS 21563 25384 . + 0 ID=cds-YP_009724390.1;Parent=gene-GU280_gp02;Dbxref=Genbank:YP_009724390.1,GeneID:43740568;Name=YP_009724390.1;Note=structural protein%3B spike protein;gbkey=CDS;gene=S;locus_tag=GU280_gp02;product=surface glycoprotein;protein_id=YP_009724390.1 NC_045512v2 RefSeq gene 21563 25384 . + . ID=gene-GU280_gp02;Dbxref=GeneID:43740568;Name=S;gbkey=Gene;gene=S;gene_biotype=protein_coding;locus_tag=GU280_gp02 NC_045512v2 RefSeq CDS 25393 26220 . + 0 ID=cds-YP_009724391.1;Parent=gene-GU280_gp03;Dbxref=Genbank:YP_009724391.1,GeneID:43740569;Name=YP_009724391.1;gbkey=CDS;gene=ORF3a;locus_tag=GU280_gp03;product=ORF3a protein;protein_id=YP_009724391.1 NC_045512v2 RefSeq gene 25393 26220 . + . ID=gene-GU280_gp03;Dbxref=GeneID:43740569;Name=ORF3a;gbkey=Gene;gene=ORF3a;gene_biotype=protein_coding;locus_tag=GU280_gp03

aparton commented 4 years ago

Hi,

If it would be possible, could you please send the files to helpdesk@ensembl.org and I can pick them up from there?

Thanks, Andrew

etapanari commented 4 years ago

sure!

aparton commented 4 years ago

Hi,

Apologies for the delay in getting back to you. I've taken a look at these differences this morning, and it seems as if the issue is with the GFF file - VEP is expecting lines of type 'transcript' and 'exon' to allow it to construct the transcript model required to annotate your variants.

You can see an example of the gene, transcript, exon and CDS model format that VEP expects within GFF files here: https://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html#gfftypes

If you have any further questions, please let us know.

Kind Regards, Andrew

etapanari commented 4 years ago

Hi Andrew,

Thanks a lot for your help! I have edited the GFF to include transcript and exon lines and now it works!

Best regards, Electra

stefanches7 commented 4 years ago

Hello @etapanari,

can you share this updated annotation with "transcipt" and "exon" lines?

Thanks in advance and best regards, Stefan

aparton commented 4 years ago

Hi @stefanches7,

Just incase you're interested, we now have an Ensembl COVID-19 site where you can find a gff file that is supported in VEP - https://covid-19.ensembl.org/info/data/ftp/index.html

Kind Regards, Andrew

stefanches7 commented 4 years ago

Thanks @aparton, this is quite useful! This file was at first hard to find, because I've tried to "Export data via the website" but found no "Export data" button. https://covid-19.ensembl.org/downloads.html also seems to point only to the help page, but not the actual data.

The other point are the subproteins that are there in pp1ab polyprotein. In my group, for instance, we are interested for variant consequences on protein level, so maybe it would be useful if annotation also contained the respective fields for non-structural proteins? I've done it now manually just by fetching UCSC uniProtCov table and converting the information to GFF3 format.

Best regards, Stefan

aparton commented 4 years ago

Hi @stefanches7,

Thank you for your feedback, I've passed it on to the appropriate people.

I'm going to close this ticket now. If you have any further questions, please feel free to reopen it or open a new one.

Kind Regards, Andrew