Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
449 stars 151 forks source link

dbNSFP plugin not getting results for specified columns #1649

Closed aruta321 closed 5 months ago

aruta321 commented 6 months ago

I am trying to run VEP with the dbNSFP plugin. I downloaded v4.3 of the dbNSFP database and extracted the hg19 information as described in the vep plugins page. I split my inputs by chromosome and I'm running as a slurm array to decrease processing time.

Here is the code that I am using to run VEP:

#!/bin/bash
#SBATCH -t 4-0:0
#SBATCH -c 4
#SBATCH --mem-per-cpu 4G
#SBATCH --array=1-24
#SBATCH --output=Array.%A_%a.out

file_chr=${SLURM_ARRAY_TASK_ID}
if [ "$file_chr" -eq "23" ]; then
        file_chr="X"
fi

if [ "$file_chr" -eq "24" ]; then
        file_chr="Y"
fi

vep -i "all_variants_${file_chr}.vcf" -o "all_variants_vep_${file_chr}.tsv" --format "vcf" --tab --database --/numbers --species "human" --grch37 --force_overwrite --verbose --fork 4 --uniprot --symbol --hgvs --protein --mane --show_ref_allele --gene_phenotype --clin_sig_allele 1 -af --exclude_null_alleles --no_check_alleles --dir_plugins "vep/ensembl-vep/cache/Plugins/" --plugin dbNSFP,"dbNSFP4.3a/dbNSFPv4.3a_hg19.gz",SIFT_score,SIFT4G_score,Polyphen2_HDIV_score,MetaSVM_score,REVEL_score,MVP_score,MPC_score,PrimateAI_score,ClinPred_score,CADD_phred_hg19,GERP++_RS,phyloP100way_vertebrate

VEP version is 107.

This is a portion of the output I get for chromosome 1.

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Gene | Feature | Feature_type | Consequence | cDNA_position | CDS_position | Protein_position | Amino_acids | Codons | Existing_variation | REF_ALLELE | IMPACT | DISTANCE | STRAND | FLAGS | SYMBOL | SYMBOL_SOURCE | HGNC_ID | MANE_SELECT | MANE_PLUS_CLINICAL | ENSP | SWISSPROT | TREMBL | UNIPARC | UNIPROT_ISOFORM | GENE_PHENO | EXON | INTRON | HGVSc | HGVSp | HGVS_OFFSET | AF | CLIN_SIG | SOMATIC | PHENO | CADD_phred_hg19 | ClinPred_score | GERP++_RS | MPC_score | MVP_score | MetaSVM_score | Polyphen2_HDIV_score | PrimateAI_score | REVEL_score | SIFT4G_score | SIFT_score | phyloP100way_vertebrate -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- ENSG00000188157 | ENST00000379370 | Transcript | missense_variant | 55 | 5 | 2 | A/D | gCc/gAc | rs776698665 | C | MODERATE | - | 1 | - | AGRN | HGNC | 329 | - | - | ENSP00000368678 | AGRIN_HUMAN | Q5XG79_HUMAN | UPI00001D7C8B | - | 1 | Jan-36 | - | ENST00000379370.2:c.5C>A | ENSP00000368678.2:p.Ala2Asp | - | - | uncertain_significance | - | 1 | - | - | - | - | - | - | - | - | - | - | - | - ENSG00000188157 | ENST00000379370 | Transcript | missense_variant | 2155 | 2105 | 702 | P/L | cCg/cTg | rs149814455 | C | MODERATE | - | 1 | - | AGRN | HGNC | 329 | - | - | ENSP00000368678 | AGRIN_HUMAN | Q5XG79_HUMAN | UPI00001D7C8B | - | 1 | Nov-36 | - | ENST00000379370.2:c.2105C>T | ENSP00000368678.2:p.Pro702Leu | - | - | uncertain_significance | - | 1 | - | - | - | - | - | - | - | - | - | - | - | - ENSG00000188157 | ENST00000466223 | Transcript | upstream_gene_variant | - | - | - | - | - | rs149814455 | C | MODIFIER | 2987 | 1 | - | AGRN | HGNC | 329 | - | - | - | - | - | - | - | 1 | - | - | - | - | - | - | uncertain_significance | - | 1 | - | - | - | - | - | - | - | - | - | - | - | - ENSG00000188157 | ENST00000469403 | Transcript | downstream_gene_variant | - | - | - | - | - | rs149814455 | C | MODIFIER | 2817 | 1 | - | AGRN | HGNC | 329 | - | - | - | - | - | - | - | 1 | - | - | - | - | - | - | uncertain_significance | - | 1 | - | - | - | - | - | - | - | - | - | - | - | - ENSG00000188157 | ENST00000477585 | Transcript | downstream_gene_variant | - | - | - | - | - | rs149814455 | C | MODIFIER | 3489 | 1 | - | AGRN | HGNC | 329 | - | - | - | - | - | - | - | 1 | - | - | - | - | - | - | uncertain_significance | - | 1 | - | - | - | - | - | - | - | - | - | - | - | - ENSG00000188157 | ENST00000478677 | Transcript | upstream_gene_variant | - | - | - | - | - | rs149814455 | C | MODIFIER | 3261 | 1 | - | AGRN | HGNC | 329 | - | - | - | - | - | - | - | 1 | - | - | - | - | - | - | uncertain_significance | - | 1 | - | - | - | - | - | - | - | - | - | - | - | - ENSG00000188157 | ENST00000479707 | Transcript | upstream_gene_variant | - | - | - | - | - | rs149814455 | C | MODIFIER | 1185 | 1 | - | AGRN | HGNC | 329 | - | - | - | - | - | - | - | 1 | - | - | - | - | - | - | uncertain_significance | - | 1 | - | - | - | - | - | - | - | - | - | - | - | - ENSG00000188157 | ENST00000492947 | Transcript | upstream_gene_variant | - | - | - | - | - | rs149814455 | C | MODIFIER | 4315 | 1 | - | AGRN | HGNC | 329 | - | - | - | - | - | - | - | 1 | - | - | - | - | - | - | uncertain_significance | - | 1 | - | - | - | - | - | - | - | - | - | - | - | - ENSG00000188157 | ENST00000379370 | Transcript | missense_variant | 3582 | 3532 | 1178 | R/W | Cgg/Tgg | rs149268246 | C | MODERATE | - | 1 | - | AGRN | HGNC | 329 | - | - | ENSP00000368678 | AGRIN_HUMAN | Q5XG79_HUMAN | UPI00001D7C8B | - | 1 | 21/36 | - | ENST00000379370.2:c.3532C>T | ENSP00000368678.2:p.Arg1178Trp | - | - | uncertain_significance | - | 1 | - | - | - | - | - | - | - | - | - | - | - | - ENSG00000188157 | ENST00000419249 | Transcript | upstream_gene_variant | - | - | - | - | - | rs149268246 | C | MODIFIER | 2968 | 1 | cds_start_NF,cds_end_NF | AGRN | HGNC | 329 | - | - | ENSP00000400771 | - | - | UPI000059CF46 | - | 1 | - | - | - | - | - | - | uncertain_significance | - | 1 | - | - | - | - | - | - | - | - | - | - | - | - ENSG00000188157 | ENST00000461111 | Transcript | upstream_gene_variant | - | - | - | - | - | rs149268246 | C | MODIFIER | 4398 | 1 | - | AGRN | HGNC | 329 | - | - | - | - | - | - | - | 1 | - | - | - | - | - | - | uncertain_significance | - | 1 | - | - | - | - | - | - | - | - | - | - | - | - ENSG00000188157 | ENST00000466223 | Transcript | non_coding_transcript_exon_variant | 270 | - | - | - | - | rs149268246 | C | MODIFIER | - | 1 | - | AGRN | HGNC | 329 | - | - | - | - | - | - | - | 1 | 3-Feb | - | ENST00000466223.1:n.270C>T | - | - | - | uncertain_significance | - | 1 | - | - | - | - | - | - | - | - | - | - | - | - ENSG00000188157 | ENST00000478677 | Transcript | non_coding_transcript_exon_variant | 114 | - | - | - | - | rs149268246 | C | MODIFIER | - | 1 | - | AGRN | HGNC | 329 | - | - | - | - | - | - | - | 1 | 2-Jan | - | ENST00000478677.1:n.114C>T | - | - | - | uncertain_significance | - | 1 | - | - | - | - | - | - | - | - | - | - | - | - ENSG00000188157 | ENST00000479707 | Transcript | downstream_gene_variant | - | - | - | - | - | rs149268246 | C | MODIFIER | 1239 | 1 | - | AGRN | HGNC | 329 | - | - | - | - | - | - | - | 1 | - | - | - | - | - | - | uncertain_significance | - | 1 | - | - | - | - | - | - | - | - | - | - | - | - ENSG00000188157 | ENST00000492947 | Transcript | upstream_gene_variant | - | - | - | - | - | rs149268246 | C | MODIFIER | 941 | 1 | - | AGRN | HGNC | 329 | - | - | - | - | - | - | - | 1 | - | - | - | - | - | - | uncertain_significance | - | 1 | - | - | - | - | - | - | - | - | - | - | - | - ENSG00000188157 | ENST00000379370 | Transcript | missense_variant | 3768 | 3718 | 1240 | P/S | Ccg/Tcg | rs768688185 | C | MODERATE | - | 1 | - | AGRN | HGNC | 329 | - | - | ENSP00000368678 | AGRIN_HUMAN | Q5XG79_HUMAN | UPI00001D7C8B | - | 1 | 22/36 | - | ENST00000379370.2:c.3718C>T | ENSP00000368678.2:p.Pro1240Ser | - | - | uncertain_significance | - | 1 | - | - | - | - | - | - | - | - | - | - | - | -

All of the columns I specify for dbNSFP are empty in the output, but have contents in the dbNSFP file.

I searched for a couple of these variants using: tabix dbNSFP4.3a/dbNSFPv4.3a_hg19.gz 1:955556-955557 | awk -F'\t' 'BEGIN{OFS=FS} {print $8, $9, $15 $123, $106, $156, $92, $90, $70, $44, $94, $83, $41, $38, $158}' -

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

hg19_chr | hg19_pos(1-based) | Ensembl_transcriptid | CADD_phred_hg19 | ClinPred_score | GERP++_RS | MPC_score | MVP_score | MetaSVM_score | Polyphen2_HDIV_score | PrimateAI_score | REVEL_score | SIFT4G_score | SIFT_score | phyloP100way_vertebrate -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 1 | 955556 | ENST00000379370 | 11.89 | 0.367794 | 1.06 | 0.415845 | 0.668191 | -0.8207 | . | 0.856078 | 0.236 | 0 | 0 | 0.358 1 | 955556 | ENST00000379370 | 13.72 | 0.290411 | 1.06 | 0.561226 | 0.718105 | -0.7761 | . | 0.87069 | 0.264 | 0 | 0 | 0.358 1 | 955556 | ENST00000379370 | 10.28 | 0.506939 | 1.06 | 0.39126 | 0.670494 | -0.8484 | . | 0.839636 | 0.21 | 0 | 0 | 0.358 1 | 955557 | ENST00000379370 | 12.86 | 0.078814 | 1.03 | 0.650004 | 0.750006 | -0.7943 | . | 0.884505 | 0.165 | 0 | 0 | 0.091 1 | 955557 | ENST00000379370 | 12.43 | 0.175077 | 1.03 | 0.433029 | 0.681611 | -0.8573 | . | 0.847095 | 0.213 | 0 | 0 | 0.091 1 | 955557 | ENST00000379370 | 13.09 | 0.426675 | 1.03 | 0.42837 | 0.703744 | -0.8858 | . | 0.859409 | 0.141 | 0 | 0 | 0.091 1 | 979593 | ENST00000379370;ENST00000620552 | 21.8 | 0.847185 | 3.65 | 0.546086258435;. | 0.678861936705;0.678861936705 | -1.0463 | .;. | 0.355294 | 0.098;. | 0.574;0.652 | 0.191;. | 3.911 1 | 979593 | ENST00000379370;ENST00000620552 | 21.9 | 0.559879 | 3.65 | 0.480733888797;. | 0.695163590071;0.695163590071 | -1.0625 | .;. | 0.337997 | 0.066;. | 0.284;0.314 | 0.192;. | 3.911 1 | 979593 | ENST00000379370;ENST00000620552 | 22.3 | 0.62114 | 3.65 | 0.518356865667;. | 0.694585858326;0.694585858326 | -1.064 | .;. | 0.330193 | 0.071;. | 0.503;0.566 | 0.225;. | 3.911 1 | 979594 | ENST00000379370;ENST00000620552 | 22.6 | 0.227134 | 3.63 | 0.540742379595;. | 0.700170531623;0.700170531623 | -1.037 | .;. | 0.33509 | 0.117;. | 0.219;0.241 | 0.123;. | 4.085 1 | 979594 | ENST00000379370;ENST00000620552 | 16.75 | 0.161269 | 3.63 | 0.577386553836;. | 0.573986203122;0.573986203122 | -1.0335 | .;. | 0.375559 | 0.052;. | 0.821;0.846 | 0.382;. | 4.085 1 | 979594 | ENST00000379370;ENST00000620552 | 19.8 | 0.167297 | 3.63 | 0.539180951264;. | 0.613412721671;0.613412721671 | -1.0792 | .;. | 0.377995 | 0.118;. | 0.101;0.112 | 0.222;. | 4.085

This is true for most of the chromosomes, but some, like chromosome 9, have accurate results.

Here is the error log for my runs: Use of uninitialized value $taxonomy_db in concatenation (.) or string at .../VEP/107-GCC-11.3.0/modules/api/Bio/EnsEMBL/Registry.pm line 2355. Will only load v107 databases Species 'homo_sapiens' loaded from database 'homo_sapiens_core_107_37' Species 'homo_sapiens' loaded from database 'homo_sapiens_cdna_107_37' Species 'homo_sapiens' loaded from database 'homo_sapiens_otherfeatures_107_37' Species 'homo_sapiens' loaded from database 'homo_sapiens_rnaseq_107_37' homo_sapiens_variation_107_37 loaded homo_sapiens_funcgen_107_37 loaded Bio::EnsEMBL::Compara::DBSQL::DBAdaptor not found so the following compara databases will be ignored: ensembl_compara_107 No ancestral database found ensembl_ontology_107 loaded ensembl_taxonomy API not found - ignoring No ensembl_metadata database found No production database or adaptor found ensembl_stable_ids_107 loaded Use of uninitialized value $ref in join or string at .../VEP/107-GCC-11.3.0/modules/Bio/EnsEMBL/VEP/Parser/VCF.pm line 235. Use of uninitialized value in split at .../VEP/107-GCC-11.3.0/modules/api/Bio/EnsEMBL/IO/Parser/BaseVCF4.pm line 446. Use of uninitialized value $ref in subtraction (-) at .../VEP/107-GCC-11.3.0/modules/Bio/EnsEMBL/VEP/Parser/VCF.pm line 252. Use of uninitialized value in string eq at .../VEP/107-GCC-11.3.0/modules/Bio/EnsEMBL/VEP/Parser/VCF.pm line 257. Use of uninitialized value $ref in concatenation (.) or string at .../VEP/107-GCC-11.3.0/modules/Bio/EnsEMBL/VEP/Parser/VCF.pm line 337. Use of uninitialized value in sprintf at ...//VEP/107-GCC-11.3.0/modules/Bio/EnsEMBL/VEP/Parser.pm line 526. Use of uninitialized value in concatenation (.) or string at .../VEP/107-GCC-11.3.0/modules/Bio/EnsEMBL/VEP/Parser.pm line 535. WARNING: Start or end -1 coordinate invalid on line 1

dglemos commented 6 months ago

Hi @aruta321, Can you please send a sample of your input VCF? There are a few warnings probably caused by the input file format.

aruta321 commented 6 months ago

Hi @dglemos,

I found a couple problems in my input VCF file. Pandas was adding tabs onto the end of the header and added a couple extra quotes to a line. I reran VEP after correcting those problems and got the same output. Most of the errors and warnings are gone now. Only "Use of uninitialized value $taxonomy_db in concatenation (.) or string at .../VEP/107-GCC-11.3.0/modules/api/Bio/EnsEMBL/Registry.pm line 2355." remains. I've attached an example of the VCF after fixing the issues above. example_vcf.txt

dglemos commented 6 months ago

Thanks for sending the file. I tried to reproduce the issue but unfortunately I couldn't. This is my output for the variant 1 955557 C A:

1_955557_C/A 1:955557 A ENSG00000188157 ENST00000379370 Transcript missense_variant 55 5 2 A/D gCc/gAc - IMPACT=MODERATE;STRAND=1;GENE_PHENO=1;CADD_phred_hg19=12.86;ClinPred_score=0.0788140874515326;GERP++_RS=1.03;MPC_score=0.650003792899;MVP_score=0.750005655083;MetaSVM_score=-0.7943;PrimateAI_score=0.88450473547;REVEL_score=0.165;SIFT4G_score=0.0;SIFT_score=0.0;phyloP100way_vertebrate=0.091000

Can you please run VEP only for this variant with the following options: vep -i input_variant.vcf -o output.tsv --format "vcf" --database --grch37 --force_overwrite --verbose --dir_plugins "vep/ensembl-vep/cache/Plugins/" --plugin dbNSFP,"dbNSFP4.3a/dbNSFPv4.3a_hg19.gz",SIFT_score,SIFT4G_score,Polyphen2_HDIV_score,MetaSVM_score,REVEL_score,MVP_score,MPC_score,PrimateAI_score,ClinPred_score,CADD_phred_hg19,GERP++_RS,phyloP100way_vertebrate

aruta321 commented 6 months ago

Here is the output for the single variant. It doesn't even have the columns for dbNSFP data. mock_vep_tsv.txt

dglemos commented 5 months ago

Can you print the result of the command tabix dbNSFP4.3a/dbNSFPv4.3a_hg19.gz 1:955556-955557

aruta321 commented 5 months ago

Here is the output for that command. tabix_output.txt

dglemos commented 5 months ago

The file looks correct. I can't find a reason why the output doesn't include the dbNSFP values. You mentioned the results are correct for some chromosomes. Are the results consistent? If you run the same jobs several times, the missing values are always for the same chromosomes?

aruta321 commented 5 months ago

I ran VEP 3 times and had a colleague run it with the same script once. They produced the same files in each instance. The CLIN_SIG column would sometimes switch the order of the classifications like "benign, pathogenic" becoming "pathogenic, benign." There weren't any content differences between the runs.

Here are two partial results from the latest run for chromosomes 1 and 19. Chromosome 1 still has no dbNSFP information. Chromosome 19 has some records but is missing other information present in the database. Chromosome 9, 11, 18, and 19 have partial results while the rest have no results.

mock_chr1_tsv.txt mock_chr19_tsv.txt

dglemos commented 5 months ago

Try again running vep only for the variant 1 955557 C A with the following command:

vep -i input_variant.vcf -o output.tsv --format "vcf" --database --grch37 --force_overwrite --verbose --safe --dir_plugins "vep/ensembl-vep/cache/Plugins/" --plugin dbNSFP,"dbNSFP4.3a/dbNSFPv4.3a_hg19.gz",pep_match=0,SIFT_score,SIFT4G_score,Polyphen2_HDIV_score,MetaSVM_score,REVEL_score,MVP_score,MPC_score,PrimateAI_score,ClinPred_score,CADD_phred_hg19,GERP++_RS,phyloP100way_vertebrate

aruta321 commented 5 months ago

The command you gave resulted in this error "MSG: Failed to instantiate plugin dbNSFP: ERROR: Column pep_match=0 not found in header for file." I checked the dbNSFP file and it doesnt have a pep_match column. Then I tried taking out "pep_match=0" and running VEP again. It gave the same result as the mock_vep_tsv.txt file above.

dglemos commented 5 months ago

The pep_match is an option to match only on the first position and the alt allele (documentation) This option is available in release 107 which is the version you are using. Is it correct? If so, can you confirm your code is the same as this file dbNSFP.pm

aruta321 commented 5 months ago

This was the issue. The plugin version I was using was prerelease 90 I believe. I only downloaded the dbNSFP database, so I didn't realize the discrepancy. Is there a way to identify the plugin version from the file or by running VEP?

dglemos commented 5 months ago

I'm glad you found the root of the issue! The only place where you could check the version is in VEP_plugins/plugin_config.txt. The plugin_url includes the release version.

aruta321 commented 5 months ago

Thanks for your help!