Closed Alexander-Stuckey closed 1 year ago
Hi @Alexander-Stuckey,
Thanks for your query. The dbNSFP plugin can use '&' as a seperator when multiple values are being returned - can you please send me an input variant and VEP command where you are seeing this behaviour and I can give you a more specific answer? Additionally, can you please tell me which version of dbNSFP you're using?
Kind Regards, Andrew
Hi @aparton
Here are a few variants that show this behaviour
Chr pos ref alt LoF SIFT_score
chr1 1082927 G T HC 0.0&0.146&0.0&0.146
chr1 1179270 C T HC .&.
chr1 1179270 C T HC .&.
chr1 1179270 C T HC .&.
chr1 1184981 A T HC .&.&.
Using dbNSFP version: dbNSFP4.0a The VEP command used to annotate these is pretty long, I've copied all the flags below:
"--assembly GRCh38",
"--dir_cache /resources/data/vep.caches/helix/99",
"--cache_version 99",
"--verbose",
"--no_stats",
"--fasta /public_data_resources/reference/GRCh38/GRCh38Decoy_no_alt.fa",
"--ccds",
"--uniprot",
"--hgvs",
"--symbol",
"--numbers",
"--domains",
"--regulatory",
"--canonical",
"--protein",
"--biotype",
"--tsl",
"--appris",
"--gene_phenotype",
"--af",
"--af_1kg",
"--af_esp",
"--max_af",
"--pubmed",
"--variant_class",
"--mane",
"--overlaps",
"--plugin dbNSFP,/tools/apps/restricted_academic/software/bio/dbNSFP/dbNSFP4.0a.txt.gz,LRT_score,MutationTaster_score,SIFT_score,SIFT_converted_rankscore,SIFT_pred,SIFT4G_score,SIFT4G_converted_rankscore,Polyphen2_HDIV_score,Polyphen2_HDIV_rankscore,Polyphen2_HDIV_pred,Polyphen2_HVAR_score,Polyphen2_HVAR_rankscore,Polyphen2_HVAR_pred,REVEL_score,REVEL_rankscore,MutPred_score,MutPred_rankscore,MutPred_protID,PrimateAI_pred",
"--plugin CADD,/public_data_resources/CADD/v1.5/GRCh38/whole_genome_SNVs.tsv.gz,/public_data_resources/CADD/v1.5/GRCh38/InDels.tsv.gz",
"--plugin SpliceAI,snv=/public_data_resources/SpliceAI/Predicting_splicing_from_primary_sequence-66029966/genome_scores_v1.3/spliceai_scores.raw.snv.hg38.vcf.gz,indel=/public_data_resources/SpliceAI/Predicting_splicing_from_primary_sequence-66029966/genome_scores_v1.3/spliceai_scores.raw.indel.hg38.vcf.gz",
"--plugin SpliceRegion",
"--plugin LoF,loftee_path:/resources/tools/apps/software/bio/VEP/99.1-foss-2019a-Perl-5.28.1/Plugins/loftee-GRCh38,human_ancestor_fa:/public_data_resources/vep_resources/Build-38/human_ancestor.fa.gz,gerp_bigwig:/public_data_resources/vep_resources/Build-38/gerp_conservation_scores.homo_sapiens.GRCh38.bw,conservation_file:/public_data_resources/vep_resources/Build-38/loftee.sql",
"--custom /public_data_resources/gnomad/v3/gnomad.genomes.r3.0.sites.vcf.bgz,gnomADg,vcf,exact,0,AF,AF_afr,AF_amr,AF_asj,AF_eas,AF_sas,AF_fin,AF_nfe,AF_oth,AF_ami,AF_male,AF_female",
"--custom /public_data_resources/phylop100way/hg38.phyloP100way.bw,PhyloP,bigwig",
"--custom /public_data_resources/TOPMed/allele_frequencies/bravo-dbsnp-all.vcf.gz,topmedg,vcf,exact,0,AF,SVM",
"--custom /public_data_resources/vep_resources/Build-38/gerp_conservation_scores.homo_sapiens.GRCh38.bw,GERP,bigwig",
"--fork 4",
"--compress_output bgzip"
Hi @Alexander-Stuckey,
Multiple scores correspond to different Ensembl transcript ids used by dbNSFP. You can check the transcript ids with the flag Ensembl_transcriptid
: --plugin dbNSFP,/tools/apps/restricted_academic/software/bio/dbNSFP/dbNSFP4.0a.txt.gz,(...),PrimateAI_pred,Ensembl_transcriptid
Each score correspond to a transcript id, if there is more than one then the output will include one score for each transcript separated by &
. At the moment, there is no option in vep to filter the scores. If you want to filter the results then you could do it post-vep annotation using your own criteria.
Hi, this is somewhat related to issue https://github.com/Ensembl/ensembl-vep/issues/1023
Hi @Alexander-Stuckey,
We added a new option to the dbNSFP plugin called transcript_match
which is available in the current release. This new option returns scores only for the matched Ensembl transcript ID. You could use it to get specific scores for each transcript ID.
Best wishes, Diana
Hi,
I had encountered the same issue, and "transcript_match" proved to be quite helpful for me as well, providing a single result for each transcript. However, this didn't seem to affect the results from the MutationTaster tool; they still appear to be multiple. I'm not sure what I can do to solve this problem.
Hi @mrymkdnz, As described here, MutationTaster entries are keyed on a different set of transcript IDs. Using the transcript_match flag with MutationTaster will return all results.
Information on corresponding transcript(s) for MutationTaster fields can be found using http://www.mutationtaster.org/ChrPos.html
"Looking up one variant at a time through a web app" isn't really workable at scale, and kind of defeats the purpose of using VEP for this at all. Unfortunately, the core issue here is on the dbNSFP side: the .tsv data file they provide doesn't distinguish which transcript each MutationTaster entry comes from.
Fortunately, I emailed Dr. Liu this morning and he said he's planning to try to make the MutationTaster scores transcript-specific in the next release of dbNSFP; so when that comes out, the issue should either be straightforwardly solvable or completely solve itself.
Hi,
I've noticed that when using dbNSFP to output prediction scores for polyphen and SIFT, it will output multiple '&' seperated scores (e.g. 1.0&1.0 or 1.0&.) for a single variant. This makes post processing and filtering a bit harder.
Is there any general advice on how to handle this? Can I just discard one of the scores? Is there a flag I can specify to get the plugin to output floats instead of & seperated strings for a score?
Cheers, Alex