Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
453 stars 152 forks source link

dbNSFP35a is not working #258

Closed JIBINJV closed 5 years ago

JIBINJV commented 6 years ago

Hi, I downloaded dbNSFPv3.5a and processed it as mentioned in "https://github.com/Ensembl/VEP_plugins/blob/release/93/dbNSFP.pm".
However I am getting following error, please help me to solve it

Jibin

Version: ensembl-vep:release_93

Error message

Use of uninitialized value in concatenation (.) or string at /home/jj/.vep/Plugins/dbNSFP.pm line 230. Use of uninitialized value in concatenation (.) or string at /home/jj/.vep/Plugins/dbNSFP.pm line 230. Argument "." isn't numeric in numeric eq (==) at /home/jj/.vep/Plugins/dbNSFP.pm line 283. Argument "." isn't numeric in numeric eq (==) at /home/jj/.vep/Plugins/dbNSFP.pm line 283.

Command Used for processing

wget ftp://dbnsfp:dbnsfp@dbnsfp.softgenetics.com/dbNSFPv3.5a.zip unzip dbNSFPv3.5a.zip head -n1 dbNSFP3.5a_variant.chr1 > h cat dbNSFP3.5a_variant.chr* | grep -v ^#chr | awk '$8 != "."' | sort -k8,8 -k9,9n - | cat h - | bgzip -c > dbNSFP_hg19.gz tabix -s 8 -b 9 -e 9 dbNSFP_hg19.gz

at7 commented 6 years ago

Hi Jibin, can you please confirm if you are working on GRCh37? If yes, could you please share the command you run? Thanks, Anja

JIBINJV commented 6 years ago

Hi Anja, Thank you for the reply,
Yes I am working with GRCh37

Command I used given below

/home/jj/Software/ensembl-vep/./vep -i MNS01-chr-1_3.recode.vcf --cache -sift b --polyphen b --symbol --numbers --biotype --total_length -o MNS01-chr-1_3.vep --vcf --fields Consequence,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,PolyPhen,SIFT,Protein_position,BIOTYPE --port 3337 --plugin dbNSFP,/home/jibin/Software/dbNSFP_hg19.gz,SIFT_score,SIFT_converted_rankscore,SIFT_pred,Uniprot_acc_Polyphen2,Uniprot_id_Polyphen2,Uniprot_aapos_Polyphen2,Polyphen2_HDIV_score,Polyphen2_HDIV_rankscore,Polyphen2_HDIV_pred,Polyphen2_HVAR_score,Polyphen2_HVAR_rankscore,Polyphen2_HVAR_pred,LRT_score,LRT_converted_rankscore,LRT_pred,LRT_Omega,MutationTaster_score,MutationTaster_converted_rankscore,MutationTaster_pred,MutationTaster_model,MutationTaster_AAE,MutationAssessor_UniprotID,MutationAssessor_variant,MutationAssessor_score,MutationAssessor_score_rankscore,MutationAssessor_pred,FATHMM_score,FATHMM_converted_rankscore,FATHMM_pred,PROVEAN_score,PROVEAN_converted_rankscore,PROVEAN_pred,Transcript_id_VEST3,Transcript_var_VEST3,VEST3_score,VEST3_rankscore,MetaSVM_score,MetaSVM_rankscore,MetaSVM_pred,MetaLR_score,MetaLR_rankscore,MetaLR_pred,Reliability_index,M-CAP_score,M-CAP_rankscore,M-CAP_pred,REVEL_score,REVEL_rankscore,MutPred_score,MutPred_rankscore,MutPred_protID,MutPred_AAchange,MutPred_Top5features,CADD_raw,CADD_raw_rankscore,CADD_phred,DANN_score,DANN_rankscore,fathmm-MKL_coding_score --force_overwrite

at7 commented 6 years ago

Is it possible that some of your input variants have a . as their start position? Can you also test your dbNSFP file by looking at a region with: tabix dbNSFP_hg19.gz 2:45686-45686?

JIBINJV commented 6 years ago

I checked the vcf file but it doesn't have "." in the start position.

However I was able to run dbNSFP3.5a and produce the annotated file when I didn't use "GM12878_fitCons_score_rankscore, H1-hESC_fitCons_score_rankscore, HUVEC_fitCons_score_rankscore, integrated_fitCons_score_rankscore and MutationAssessor_score_rankscore" options with dbnsfp

at7 commented 6 years ago

Thank you for letting me know. We do have a fix for the first warning message: https://github.com/Ensembl/VEP_plugins/pull/88 I merged it also to release/93 branch. If you update your checkout the message should go away.

I tried to replicate the second type of warning message (Argument "." isn't numeric in numeric..) I was running your command including all the dbNSFP headers you are using on a VCF file which contains variants on all the chromosomes. I only got the the first warning.

Can you please provide an example input line for which you got the second type of warning message?

at7 commented 5 years ago

I'm closing the issue. But please reopen it if you have an further related questions or open a new issue thread.

PedroBarbosa commented 5 years ago

@at7 I'm having an issue with dbNSFP plugin as well. I decided not to open a new issue, but let me know if it is more appropriate:

I apply VEP using my local cache (ensembl 94), using the GRCh37 version. I downloaded the dbNSFPv3.5a file and proceeded with the instructions in the perl plugin to get a final file to use with hg19 coordinates. It runs smoothly, but constantly I find a lot of unpredictable cases, even when I'm dealing with well known misense mutations. Since I also have CADD scores stored locally, I decided to compare the CADD scores obtained directly using CADD files vs CADD scores that resulted from dbNSFP . As you can see below, for a small set of well known pathogenic mutations, there is an huge discrepancy between the outputs, where the first column represents the scores from CADD files, whereas the second stands for dbNFSP. Why such differences? I'm afraid the same rationale applies for all the other scores that dbNSFP stores, so I may be overlooking a great number of variants just because there is some problem with dbNSFP.

This is the command I used: vep -i $variants -o $output_vcf --compress_output bgzip --stats_text --hgvs --hgvsg --pubmed --cac he --dir /media/cache --offline --vcf --numbers --regulatory --sift b --polyphen b --force_overwrite -a GRCh37 --variant_class --vcf_info_field ANN --plugin CADD,/media/custom_data/cadd/whole_genome_SNVs.tsv.gz,/media/custom_data/cadd/InDels.tsv.gz --plugin dbNSFP,/media/custom_data/dbNSFP/hg19/dbNSFP_hg19.gz,GERP++_RS,phyloP20way_mammalian,SiPhy_29way_logOdds,phastCons20way_mammalian,MutationAssessor_score,SIFT_score,Polyphen2_HDIV_ score,Polyphen2_HDIV_pred,Polyphen2_HVAR_score,Polyphen2_HVAR_pred,MutationTaster_score,MutationTaster_pred,MutationTaster_model,MutationTaster_AAE,FATHMM_score,FATHMM_pred,fathmm-MKL_coding_score,fathmm-MKL_coding_pred,PROVEAN_score,PROVEAN_pred,CADD_phred,DANN_score,Eigen-phred,Eigen-PC-phred,MetaSVM_score,MetaSVM_pred,MetaLR_score,MetaLR_pred,REVEL_score,GM12878_fitCons_score,H1-hESC_fitCons_score,HUVEC_fitCons_score_rankscore,LRT_score,LRT_pred,MutPred_score,VEST3_score,M-CAP_score

And these are the CADD scores:

CADD_PHRED      CADD_phred
28.2    28.2
14.76   None
13.15   None
25.9    25.9
35      35
14.70   None
38      38
None    None
20.7    None
26.4    26.4
None    None
25.7    None
28.3    None
None    None
25.1    None
25.4    None
23.2    None
24.8    None
25.9    25.9
23.2    None
27.5    None
27.5    None
24.0    None
27.6    None
22.8    None
None    None
35      35
29.9    29.9
25.8    None
22.7    None
23.9    None
23.9    None
28.6    None
27.3    None
23.9    None
25.5    None
None    None
26.5    None
38      38
None    None
10.23   10.23
25.2    None
23.2    None
27.8    27.8
37      37
12.84   None
12.59   None
33      33
43      43
46      46
31      31
15.91   15.91
35      35
34      34
26.7    26.7
33      33
15.36   None
25.7    None
25.7    25.7
28.3    28.3
14.66   None
41      41
24.9    24.9
32      32
27.1    27.1
35      None
35      None
23.4    None
35      None
27.3    None
29.9    None
34      None
None    None
12.50   None
34      None

Do you have any explanation for this behavior? Thanks, Pedro Barbosa

at7 commented 5 years ago

Hi Pedro, I will take a look and get back to you soon. Anja

at7 commented 5 years ago

I annotated a few variants with the CADD and dbNSFP plugins and always got results from both files. Can you please share some of your input variants for which you only get results from the CADD plugin? Thank you.

PedroBarbosa commented 5 years ago

Please find attached the small set of variants I'm testing. variants.vep.input.txt

Yesterday, I updated my VEP installation (to include Refseq annotations), which also included the update of all VEP plugins. Suprinsingly, now I can't reproduce the same list as yesterday, as only 3 variants are annotated using the local CADD (even less then using dbNSFP)

CADD_PHRED      CADD_phred
None    28.2
None    None
None    None
None    25.9
None    35
None    None
None    38
None    None
None    None
None    26.4
None    None
None    None
None    None
None    None
None    None
None    None
None    None
None    None
None    25.9
None    None
None    None
None    None
None    None
None    None
None    None
None    None
None    35
None    29.9
None    None
None    None
None    None
None    None
None    None
None    None
None    None
None    None
None    None
None    None
None    38
None    10.23
None    None
None    None
None    None
None    27.8
None    37
None    None
None    None
None    33
None    43
None    46
None    31
None    15.91
None    35
None    34
None    26.7
None    33
None    None
None    None
None    25.7
None    28.3
None    None
None    41
24.9    24.9
32      32
27.1    27.1
None    None
None    None
None    None
None    None
None    None
None    None
None    None
None    None
None    None
None    None
at7 commented 5 years ago

Can you please check that the strands of your input variants are set correctly? A lot of the variants I tested are already on the forward strand but in the input file the strand is given as reverse. Let me know if you are still missing annotations after you have corrected the strands.

PedroBarbosa commented 5 years ago

I see, the alleles were representative of the positive strand, while the strand field displayed the strand of the gene itself, thanks.

Sill, I corrected the strands, and the differences remain (CADD vs dbNSFP)

at7 commented 5 years ago

Can you please send me the updated input file? It is also worth mentioning that the dbNSFP plugin only returns values for variants of the following consequence types: missense_variant stop_lost stop_gained start_lost.

PedroBarbosa commented 5 years ago

It is the problem indeed. The ones missing in dbNSFP are not missense_variant stop_lost stop_gained start_lost.

Out of curiosity, why does the VEP plugin discards the other types ? As far as I'm aware, dbNSFP file reports splicing related consequences.

at7 commented 5 years ago

That is a good point. We initially chose to filter by consequence type for speed purpose. But we need to revise this and allow for more consequence types to be included. I will run some tests before I submit a change to the plugin code. If you want to already remove the filter just remove this line from the plugin. I'm aiming to get this improved for the next ensembl release. I will also post any updates here. Thank you.

PedroBarbosa commented 5 years ago

Well by removing that line I mostly get None values.

I'll be waiting for the update. Thank you

MartaRus commented 5 years ago

@at7 I'm hav

@at7 I'm having an issue with dbNSFP plugin as well. I decided not to open a new issue, but let me know if it is more appropriate:

I apply VEP using my local cache (ensembl 94), using the GRCh37 version. I downloaded the dbNSFPv3.5a file and proceeded with the instructions in the perl plugin to get a final file to use with hg19 coordinates. It runs smoothly, but constantly I find a lot of unpredictable cases, even when I'm dealing with well known misense mutations. Since I also have CADD scores stored locally, I decided to compare the CADD scores obtained directly using CADD files vs CADD scores that resulted from dbNSFP . As you can see below, for a small set of well known pathogenic mutations, there is an huge discrepancy between the outputs, where the first column represents the scores from CADD files, whereas the second stands for dbNFSP. Why such differences? I'm afraid the same rationale applies for all the other scores that dbNSFP stores, so I may be overlooking a great number of variants just because there is some problem with dbNSFP.

This is the command I used: vep -i $variants -o $output_vcf --compress_output bgzip --stats_text --hgvs --hgvsg --pubmed --cac he --dir /media/cache --offline --vcf --numbers --regulatory --sift b --polyphen b --force_overwrite -a GRCh37 --variant_class --vcf_info_field ANN --plugin CADD,/media/custom_data/cadd/whole_genome_SNVs.tsv.gz,/media/custom_data/cadd/InDels.tsv.gz --plugin dbNSFP,/media/custom_data/dbNSFP/hg19/dbNSFP_hg19.gz,GERP++_RS,phyloP20way_mammalian,SiPhy_29way_logOdds,phastCons20way_mammalian,MutationAssessor_score,SIFT_score,Polyphen2_HDIV_ score,Polyphen2_HDIV_pred,Polyphen2_HVAR_score,Polyphen2_HVAR_pred,MutationTaster_score,MutationTaster_pred,MutationTaster_model,MutationTaster_AAE,FATHMM_score,FATHMM_pred,fathmm-MKL_coding_score,fathmm-MKL_coding_pred,PROVEAN_score,PROVEAN_pred,CADD_phred,DANN_score,Eigen-phred,Eigen-PC-phred,MetaSVM_score,MetaSVM_pred,MetaLR_score,MetaLR_pred,REVEL_score,GM12878_fitCons_score,H1-hESC_fitCons_score,HUVEC_fitCons_score_rankscore,LRT_score,LRT_pred,MutPred_score,VEST3_score,M-CAP_score

And these are the CADD scores:

CADD_PHRED      CADD_phred
28.2    28.2
14.76   None
13.15   None
25.9    25.9
35      35
14.70   None
38      38
None    None
20.7    None
26.4    26.4
None    None
25.7    None
28.3    None
None    None
25.1    None
25.4    None
23.2    None
24.8    None
25.9    25.9
23.2    None
27.5    None
27.5    None
24.0    None
27.6    None
22.8    None
None    None
35      35
29.9    29.9
25.8    None
22.7    None
23.9    None
23.9    None
28.6    None
27.3    None
23.9    None
25.5    None
None    None
26.5    None
38      38
None    None
10.23   10.23
25.2    None
23.2    None
27.8    27.8
37      37
12.84   None
12.59   None
33      33
43      43
46      46
31      31
15.91   15.91
35      35
34      34
26.7    26.7
33      33
15.36   None
25.7    None
25.7    25.7
28.3    28.3
14.66   None
41      41
24.9    24.9
32      32
27.1    27.1
35      None
35      None
23.4    None
35      None
27.3    None
29.9    None
34      None
None    None
12.50   None
34      None

Do you have any explanation for this behavior? Thanks, Pedro Barbosa

Hi @PedroBarbosa I would like to ask you an help in order to sort the dbNSFP3.5 database with Hg19 coordinates. I run the following comand: cat dbNSFP3.5a_variant.chr* \ | /path/to//scripts/dbNSFP_sort.pl 7 8 \

dbNSFP3.5a.hg19.txt bgzip dbNSFP3.5a.hg19.txt tabix -s 1 -b 2 -e 2 dbNSFP3.5a.hg19.txt.gz

but if I open one of the dbNSFP_variant.chr* file, I see that the Hg19 coordinates are in the columns 8 and 9 (#chr pos(1-based) ref alt aaref aaalt rs_dbSNP150 hg19_chr hg19_pos(1-based) hg18_chr hg18_pos(1-based) genename cds_strand refcodon codonpos codon_degeneracy ) Do I need to change the comand launched?

thank you

MartaRus commented 5 years ago

Hi @at7 or @PedroBarbosa can you help me with the dbNSFP3.5? I'm using the database with the Hg19 coordinates and I'm trying to annotate a vcf but there is something wrong: if I annotate with the 3.5 version I obtained only a few annotations, indeed with the 2.9 version I annotated all the variants. Here an example: grep SIFT_pred myfile_2.9.vcf | wc -l 9416 grep SIFT_pred myfile_3.5.vcf | wc -l 376

Why?

thank you in advance

at7 commented 5 years ago

@MartaRus, could you please send me a few examples for which you get a result for 2.9 but not for 3.5? Thank you

MartaRus commented 5 years ago

@at7 thank you for your reply! Meanwhile I understood where is the problem but I am not able to find a solution! Maybe you can help me! The script is:

cat dbNSFP3.5a_variant.chr* \ | /mnt/storage/FreeNGS/Prog/Resources/Bin/snpEff_latest/scripts/dbNSFP_sort.pl 7 8 \

dbNSFP3.5a.hg19_2.txt bgzip dbNSFP3.5a.hg19_2.txt tabix -s 1 -b 2 -e 2 dbNSFP3.5a.hg19.txt_2.gz

but I cannot understand why it stops in the middle of the fisrt step (cat dbNSFP3.5a_variant.chr). The process well works for a few chromosomes, then it stops with the writings "killed". I cannot understand why. It is to note that if I tryed to launch the only first step (cat dbNSFP3.5a_variant.chr > total_var) it works well and all the chromosomes are processed, but if I sequentially launch the second part of the script (/path/dbNSFP_sort.pl 7 8 dbNSFP3.5_var_tot > dbNSFP3.5_var_tot_sortHg19 it douesn't work (the error: Usage cat dbNSFP*.txt | dbNSFP_sort.pl chrCol posCol in attachment the sort.pl script

thank you dbNSFP_sort.txt

at7 commented 5 years ago

Hi @MartaRus, the script your are using isn't written by Ensembl. Please contact the author for any questions. Are you trying to tabix dbNSFP files for GRCh37? We have written some documentation for the required steps. Let me know if you have any questions.