Ensembl / VEP_plugins

Plugins for the Ensembl Variant Effect Predictor (VEP)
Apache License 2.0
139 stars 115 forks source link

dbNSFP plugin outputs multiple '&' seperated scores for (e.g.) SIFT. #421

Closed Alexander-Stuckey closed 1 year ago

Alexander-Stuckey commented 3 years ago

Hi,

I've noticed that when using dbNSFP to output prediction scores for polyphen and SIFT, it will output multiple '&' seperated scores (e.g. 1.0&1.0 or 1.0&.) for a single variant. This makes post processing and filtering a bit harder.

Is there any general advice on how to handle this? Can I just discard one of the scores? Is there a flag I can specify to get the plugin to output floats instead of & seperated strings for a score?

Cheers, Alex

aparton commented 3 years ago

Hi @Alexander-Stuckey,

Thanks for your query. The dbNSFP plugin can use '&' as a seperator when multiple values are being returned - can you please send me an input variant and VEP command where you are seeing this behaviour and I can give you a more specific answer? Additionally, can you please tell me which version of dbNSFP you're using?

Kind Regards, Andrew

Alexander-Stuckey commented 3 years ago

Hi @aparton

Here are a few variants that show this behaviour

Chr          pos          ref    alt   LoF   SIFT_score
chr1    1082927 G   T   HC  0.0&0.146&0.0&0.146
chr1    1179270 C   T   HC  .&.
chr1    1179270 C   T   HC  .&.
chr1    1179270 C   T   HC  .&.
chr1    1184981 A   T   HC  .&.&.

Using dbNSFP version: dbNSFP4.0a The VEP command used to annotate these is pretty long, I've copied all the flags below:

"--assembly GRCh38",
"--dir_cache /resources/data/vep.caches/helix/99",
"--cache_version 99",
"--verbose",
"--no_stats",
"--fasta /public_data_resources/reference/GRCh38/GRCh38Decoy_no_alt.fa",
"--ccds",
"--uniprot",
"--hgvs",
"--symbol",
"--numbers",
"--domains",
"--regulatory",
"--canonical",
"--protein",
"--biotype",
"--tsl",
"--appris",
"--gene_phenotype",
"--af",
"--af_1kg",
"--af_esp",
"--max_af",
"--pubmed",
"--variant_class",
"--mane",
"--overlaps",
"--plugin dbNSFP,/tools/apps/restricted_academic/software/bio/dbNSFP/dbNSFP4.0a.txt.gz,LRT_score,MutationTaster_score,SIFT_score,SIFT_converted_rankscore,SIFT_pred,SIFT4G_score,SIFT4G_converted_rankscore,Polyphen2_HDIV_score,Polyphen2_HDIV_rankscore,Polyphen2_HDIV_pred,Polyphen2_HVAR_score,Polyphen2_HVAR_rankscore,Polyphen2_HVAR_pred,REVEL_score,REVEL_rankscore,MutPred_score,MutPred_rankscore,MutPred_protID,PrimateAI_pred",
"--plugin CADD,/public_data_resources/CADD/v1.5/GRCh38/whole_genome_SNVs.tsv.gz,/public_data_resources/CADD/v1.5/GRCh38/InDels.tsv.gz",
"--plugin SpliceAI,snv=/public_data_resources/SpliceAI/Predicting_splicing_from_primary_sequence-66029966/genome_scores_v1.3/spliceai_scores.raw.snv.hg38.vcf.gz,indel=/public_data_resources/SpliceAI/Predicting_splicing_from_primary_sequence-66029966/genome_scores_v1.3/spliceai_scores.raw.indel.hg38.vcf.gz",
"--plugin SpliceRegion",
"--plugin LoF,loftee_path:/resources/tools/apps/software/bio/VEP/99.1-foss-2019a-Perl-5.28.1/Plugins/loftee-GRCh38,human_ancestor_fa:/public_data_resources/vep_resources/Build-38/human_ancestor.fa.gz,gerp_bigwig:/public_data_resources/vep_resources/Build-38/gerp_conservation_scores.homo_sapiens.GRCh38.bw,conservation_file:/public_data_resources/vep_resources/Build-38/loftee.sql",
"--custom /public_data_resources/gnomad/v3/gnomad.genomes.r3.0.sites.vcf.bgz,gnomADg,vcf,exact,0,AF,AF_afr,AF_amr,AF_asj,AF_eas,AF_sas,AF_fin,AF_nfe,AF_oth,AF_ami,AF_male,AF_female",
"--custom /public_data_resources/phylop100way/hg38.phyloP100way.bw,PhyloP,bigwig",
"--custom /public_data_resources/TOPMed/allele_frequencies/bravo-dbsnp-all.vcf.gz,topmedg,vcf,exact,0,AF,SVM",
"--custom /public_data_resources/vep_resources/Build-38/gerp_conservation_scores.homo_sapiens.GRCh38.bw,GERP,bigwig",
"--fork 4",
"--compress_output bgzip"
dglemos commented 3 years ago

Hi @Alexander-Stuckey, Multiple scores correspond to different Ensembl transcript ids used by dbNSFP. You can check the transcript ids with the flag Ensembl_transcriptid: --plugin dbNSFP,/tools/apps/restricted_academic/software/bio/dbNSFP/dbNSFP4.0a.txt.gz,(...),PrimateAI_pred,Ensembl_transcriptid

Each score correspond to a transcript id, if there is more than one then the output will include one score for each transcript separated by &. At the moment, there is no option in vep to filter the scores. If you want to filter the results then you could do it post-vep annotation using your own criteria.

davmlaw commented 2 years ago

Hi, this is somewhat related to issue https://github.com/Ensembl/ensembl-vep/issues/1023

dglemos commented 1 year ago

Hi @Alexander-Stuckey, We added a new option to the dbNSFP plugin called transcript_match which is available in the current release. This new option returns scores only for the matched Ensembl transcript ID. You could use it to get specific scores for each transcript ID.

Best wishes, Diana

mrymkdnz commented 1 year ago

Hi,

I had encountered the same issue, and "transcript_match" proved to be quite helpful for me as well, providing a single result for each transcript. However, this didn't seem to affect the results from the MutationTaster tool; they still appear to be multiple. I'm not sure what I can do to solve this problem.

dglemos commented 1 year ago

Hi @mrymkdnz, As described here, MutationTaster entries are keyed on a different set of transcript IDs. Using the transcript_match flag with MutationTaster will return all results.

dvg-p4 commented 2 months ago

Information on corresponding transcript(s) for MutationTaster fields can be found using http://www.mutationtaster.org/ChrPos.html

"Looking up one variant at a time through a web app" isn't really workable at scale, and kind of defeats the purpose of using VEP for this at all. Unfortunately, the core issue here is on the dbNSFP side: the .tsv data file they provide doesn't distinguish which transcript each MutationTaster entry comes from.

Fortunately, I emailed Dr. Liu this morning and he said he's planning to try to make the MutationTaster scores transcript-specific in the next release of dbNSFP; so when that comes out, the issue should either be straightforwardly solvable or completely solve itself.