Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
445 stars 151 forks source link

Cannot annotate dbSNP (rs identifiers) and failure to annotate when --fields specified #1015

Closed AndreaG5 closed 3 years ago

AndreaG5 commented 3 years ago

Describe the issue

Hello everyone, I am using VEP docker. I am able to annotate my vcf, but it fails to annotate rs identifiers (I understood they are within cache). I manually downloaded cache and some plugins and custom annotation. It seems to me that dbSNP was included in cache since I dind't find it either in plugins and custom db. I want to know if I am missing something or if there's an error.

Second issue is related to --fields option. VEP is able to annotate vcf correctly when I don't specify any --fields option. When I try to "reorder" vcf columns according to my list it fails to annotate every field within INFO column.

Additional information

No Errors, No Warnings when running.

System

Full VEP command line

./vep -i ./prova_input_1.vcf -o ./prova_output_1.vcf --force_overwrite --symbol --hgvs --merged --offline --plugin ExAC,./ExAC.0.3.GRCh37.vcf.gz -dir_plugins ./Plugins --dir_cache ./vep/ -a GRCh37 --custom ./clinvar.vcf.gz,clinvar,vcf,exact,CLNSIG,CLNREVSTAT,CLNDN --tab --fields "Uploaded_variation,Location,Allele,GIVEN_REF,Symbol,Gene,Consequence,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,Exac_AF,Exac_AF_AFR,Exac_AF_AMR,Exac_AF_EAS,Exac_AF_FIN,Exac_AF_NFE,Exac_AF_SAS,Exac_AF_OTH,Exac_AF_Adj,clinvar,clinvar_CLNREVSTAT,clinvar_CLNDN"

Thank you!

dglemos commented 3 years ago

Hi @AndreaG5, dbSNP identifiers are included in the cache. Can you try running vep with option --check_existing? Could you please share your ouptut header fields?

AndreaG5 commented 3 years ago

Hi @dglemos, thank you so much! using --check_existing solved my issue about dbSNP identifiers!

Here Header fields (trying to solve second issue i.e. annotate with --field option):

Uploaded_variation | Location | Allele | Gene | feature | Feature_type | Consequence | cDNA_position | CDS_position | Protein_position | Amino_acids | Codons | Existing_variation | Extra (IMPACT,STRAND,SYMBOL,SYMBOL_SOURCE,HGNC_ID,HGVSC,HGVSP,CLIN_SIG,EXAC_FREQ,clinvar,clinvar_CLNREVSTAT)

dglemos commented 3 years ago

The option --check_existing adds extra fields to the output. I can't see fields SOMATIC and PHENO on your header. Is this the header after you run vep with --check_existing?

Related to the missing fields in the output, you could try using --fields with just a few headers and check if the output includes those, for example start with: --fields "Uploaded_variation,Location,Allele"

AndreaG5 commented 3 years ago

My bad, SOMATIC and PHENO were present!

When I tried using --fields option it performs this "reorder" but it's unable to annotate informations. So every column in INFO (or Extra) is split and order according to my list but they're filled by "-". So for example no frequencies are available, no gene symbol etc.

example:

Here the result omitting --field option: Cattura1

Here the result with --field option Cattura

(I am sorry for the snapshot but it wasn't wasy to fit all column in a single pic)

Thanks

dglemos commented 3 years ago

Thanks for the images, it's easier to understand what's going on. --fields only works with tab or VCF format output. In your vep command line you are using --tab but your output is not tab format, it seems to be the VEP default output. Can you run vep and make sure you are using --tab?

AndreaG5 commented 3 years ago

Yes, I am sorry I put different outputs without specifing anything. I tried both using --vcf or --tab. I used --fields along with --tab (the second snapshot is referred to that). The output in both cases is the same and is the second pic

dglemos commented 3 years ago

You don't have Exac_AF (...) in your output header. Your Extra columns are only IMPACT,STRAND,SYMBOL,SYMBOL_SOURCE,HGNC_ID,HGVSC,HGVSP,CLIN_SIG,EXAC_FREQ,clinvar,clinvar_CLNREVSTAT,SOMATIC,PHENO. Only these columns are going to have an annotation in your output file.

AndreaG5 commented 3 years ago

YES I KNOW. It's just because the "Extra" field is very long (ExAC informations are present). To have an idea you can just look at SYMBOL. While it is present and correctly annotate in the first snapshot, it is missing in the second.

dglemos commented 3 years ago

I didn't notice the symbol was missing, sorry about that. In --fields you have to use exactly the same header name as in the VEP output. In VEP output is SYMBOL (capital letters), however in your fields it's Symbol. The same applies to the other headers.

AndreaG5 commented 3 years ago

Oh I am sorry, I was pretty sure I used the correct header. Thank you so much for your quick and great response!

p.s. Can I go little off-topic asking you if is there a way to directly (from launch command) split Location field into Chromosome and Position (two different column).

Again, thank you so much!

dglemos commented 3 years ago

I'm glad the issue is sorted out. Unfortunately, there is no option to split the column but if you use VCF format output the chromosome and position are in two different columns.

AndreaG5 commented 3 years ago

Ok thank you so much, Have a good day!