Open cgpreston opened 1 year ago
@cgpreston, there doesn't seem to be a simple way to get the REVEL score data from the Ensembl VEP API. We are going to ask Baylor if they can get the data for us, or if they can give me access so I can write an API endpoint into the LDH.
We discussed this in team meeting. I am going to investigate myvariant.info's dbNSFP parser code and update https://github.com/biothings/myvariant.info/issues/179.
I spent some time this afternoon investigating this myvariant.info's dbNSFP parser code.
I forked myvariant.info's GitHub repo, and based on src/hub/dataload/sources/dbnsfp/dbnsfp_43a.py
I concluded they are using version 4.3a of dbNSFP. (There is a newer version of dbNSFP. It was released on November 5th.) I read the code that parses REVEL score data. Nothing seemed obviously wrong, so I downloaded dbNSFP to my computer. I wanted to figure out if their parsing code was written correctly.
Here are the decompressed contents of the dbNSFP archive:
.
├── LICENSE.txt
├── dbNSFP4.3_gene.complete.gz
├── dbNSFP4.3_gene.gz
├── dbNSFP4.3a.readme.txt
├── dbNSFP4.3a_variant.chr1.gz
├── dbNSFP4.3a_variant.chr10.gz
├── dbNSFP4.3a_variant.chr11.gz
├── dbNSFP4.3a_variant.chr12.gz
├── dbNSFP4.3a_variant.chr13.gz
├── dbNSFP4.3a_variant.chr14.gz
├── dbNSFP4.3a_variant.chr15.gz
├── dbNSFP4.3a_variant.chr16.gz
├── dbNSFP4.3a_variant.chr17.gz
├── dbNSFP4.3a_variant.chr18.gz
├── dbNSFP4.3a_variant.chr19.gz
├── dbNSFP4.3a_variant.chr2.gz
├── dbNSFP4.3a_variant.chr20.gz
├── dbNSFP4.3a_variant.chr21.gz
├── dbNSFP4.3a_variant.chr22.gz
├── dbNSFP4.3a_variant.chr3.gz
├── dbNSFP4.3a_variant.chr4.gz
├── dbNSFP4.3a_variant.chr5.gz
├── dbNSFP4.3a_variant.chr6.gz
├── dbNSFP4.3a_variant.chr7.gz
├── dbNSFP4.3a_variant.chr8.gz
├── dbNSFP4.3a_variant.chr9.gz
├── dbNSFP4.3a_variant.chrM.gz
├── dbNSFP4.3a_variant.chrX.gz
├── dbNSFP4.3a_variant.chrY.gz
├── search_dbNSFP43a.class
├── search_dbNSFP43a.jar
├── search_dbNSFP43a.readme.pdf
├── try.vcf
├── tryhg18.in
├── tryhg19.in
└── tryhg38.in
1 directory, 36 files
Each of the dbNSFP4.3a_var.chr#.gz
files is a tab-separated values file. I decompressed dbNSFP4.3a_variant.chrX.gz
. I wanted to search through it for hg_19pos 152959399. Based on the screenshot @cgpreston provided in https://github.com/biothings/myvariant.info/issues/179, hg_19pos 152959399 should have multiple REVEL scores associated with it:
Here's what I found:
> rg --count "152959399" dbNSFP4.3a_variant.chrX
5
So there are five lines in the file that have "152959399" in them. I searched the output of the initial search for the REVEL scores @cgpreston shows in her screenshot:
> rg "152959399" dbNSFP4.3a_variant.chrX | rg --count "0\.173"
1
> rg "152959399" dbNSFP4.3a_variant.chrX | rg --count "0\.109"
> rg "152959399" dbNSFP4.3a_variant.chrX | rg --count "0\.653"
2
Some observations:
I wrote a script that prints columns instead of rows. Here are the REVEL_scores
and REVEL_rankscores
for hg_19pos 152959399:
('REVEL_score', '0.173', '0.653;0.653', '0.177', '0.653;0.653', '0.606;0.606')
('REVEL_rankscore', '0.43840', '0.86936', '0.44549', '0.86936', '0.84506')
I initially planned on using https://github.com/ClinGen/gci-vci-aws/issues/1378 to track the work for this issue, but I think it makes more sense to track the work here.
Okay, I've updated https://github.com/biothings/myvariant.info/issues/179#issuecomment-1843628916. Hopefully it is useful to them. I am going to move on to other tickets for now.
@liammulh : when you're back next week can we chat about a new approach to this project? Thanks
Now we're bringing in via LDH... switch this project to Bryan so he can look into these data in the LDH
There is an issue with the REVEL scores in the VCI (see #348 for details, as of Nov 2023 we display a warning for curators).
Going forward we should attempt the following:
Mockups: https://docs.google.com/presentation/d/1qYZzqlJmdWvsRrYILwnJw4u1zbKFtRgIeZ6oWPox-fY/edit#slide=id.g2dd3f09f5b8_0_0
We also need to update the footer with the provenance info - I've reached out to Neethu about the language.
SP ticket: https://broadinstitute.atlassian.net/browse/CGSP-654