ClinGen / gene-and-variant-curation-tools

ClinGen's gene and variant curation interfaces (GCI & VCI). Developed by Stanford ClinGen team.
https://curation.clinicalgenome.org/
MIT License
3 stars 1 forks source link

Update REVEL scores in VCI #350

Open cgpreston opened 1 year ago

cgpreston commented 1 year ago

There is an issue with the REVEL scores in the VCI (see #348 for details, as of Nov 2023 we display a warning for curators).

Going forward we should attempt the following:

  1. Work on a REVEL data plan for the LDH team

Mockups: https://docs.google.com/presentation/d/1qYZzqlJmdWvsRrYILwnJw4u1zbKFtRgIeZ6oWPox-fY/edit#slide=id.g2dd3f09f5b8_0_0

We also need to update the footer with the provenance info - I've reached out to Neethu about the language.

SP ticket: https://broadinstitute.atlassian.net/browse/CGSP-654

liammulh commented 11 months ago

@cgpreston, there doesn't seem to be a simple way to get the REVEL score data from the Ensembl VEP API. We are going to ask Baylor if they can get the data for us, or if they can give me access so I can write an API endpoint into the LDH.

liammulh commented 11 months ago

We discussed this in team meeting. I am going to investigate myvariant.info's dbNSFP parser code and update https://github.com/biothings/myvariant.info/issues/179.

liammulh commented 11 months ago

I spent some time this afternoon investigating this myvariant.info's dbNSFP parser code.

I forked myvariant.info's GitHub repo, and based on src/hub/dataload/sources/dbnsfp/dbnsfp_43a.py I concluded they are using version 4.3a of dbNSFP. (There is a newer version of dbNSFP. It was released on November 5th.) I read the code that parses REVEL score data. Nothing seemed obviously wrong, so I downloaded dbNSFP to my computer. I wanted to figure out if their parsing code was written correctly.

Here are the decompressed contents of the dbNSFP archive:

.
├── LICENSE.txt
├── dbNSFP4.3_gene.complete.gz
├── dbNSFP4.3_gene.gz
├── dbNSFP4.3a.readme.txt
├── dbNSFP4.3a_variant.chr1.gz
├── dbNSFP4.3a_variant.chr10.gz
├── dbNSFP4.3a_variant.chr11.gz
├── dbNSFP4.3a_variant.chr12.gz
├── dbNSFP4.3a_variant.chr13.gz
├── dbNSFP4.3a_variant.chr14.gz
├── dbNSFP4.3a_variant.chr15.gz
├── dbNSFP4.3a_variant.chr16.gz
├── dbNSFP4.3a_variant.chr17.gz
├── dbNSFP4.3a_variant.chr18.gz
├── dbNSFP4.3a_variant.chr19.gz
├── dbNSFP4.3a_variant.chr2.gz
├── dbNSFP4.3a_variant.chr20.gz
├── dbNSFP4.3a_variant.chr21.gz
├── dbNSFP4.3a_variant.chr22.gz
├── dbNSFP4.3a_variant.chr3.gz
├── dbNSFP4.3a_variant.chr4.gz
├── dbNSFP4.3a_variant.chr5.gz
├── dbNSFP4.3a_variant.chr6.gz
├── dbNSFP4.3a_variant.chr7.gz
├── dbNSFP4.3a_variant.chr8.gz
├── dbNSFP4.3a_variant.chr9.gz
├── dbNSFP4.3a_variant.chrM.gz
├── dbNSFP4.3a_variant.chrX.gz
├── dbNSFP4.3a_variant.chrY.gz
├── search_dbNSFP43a.class
├── search_dbNSFP43a.jar
├── search_dbNSFP43a.readme.pdf
├── try.vcf
├── tryhg18.in
├── tryhg19.in
└── tryhg38.in

1 directory, 36 files

Each of the dbNSFP4.3a_var.chr#.gz files is a tab-separated values file. I decompressed dbNSFP4.3a_variant.chrX.gz. I wanted to search through it for hg_19pos 152959399. Based on the screenshot @cgpreston provided in https://github.com/biothings/myvariant.info/issues/179, hg_19pos 152959399 should have multiple REVEL scores associated with it:

scores

Here's what I found:

> rg --count "152959399" dbNSFP4.3a_variant.chrX
5

So there are five lines in the file that have "152959399" in them. I searched the output of the initial search for the REVEL scores @cgpreston shows in her screenshot:

> rg "152959399" dbNSFP4.3a_variant.chrX | rg --count "0\.173"
1
> rg "152959399" dbNSFP4.3a_variant.chrX | rg --count "0\.109"
> rg "152959399" dbNSFP4.3a_variant.chrX | rg --count "0\.653"
2

Some observations:

search

liammulh commented 11 months ago

I wrote a script that prints columns instead of rows. Here are the REVEL_scores and REVEL_rankscores for hg_19pos 152959399:

('REVEL_score', '0.173', '0.653;0.653', '0.177', '0.653;0.653', '0.606;0.606')
('REVEL_rankscore', '0.43840', '0.86936', '0.44549', '0.86936', '0.84506')
liammulh commented 11 months ago

I initially planned on using https://github.com/ClinGen/gci-vci-aws/issues/1378 to track the work for this issue, but I think it makes more sense to track the work here.

liammulh commented 11 months ago

Okay, I've updated https://github.com/biothings/myvariant.info/issues/179#issuecomment-1843628916. Hopefully it is useful to them. I am going to move on to other tickets for now.

cgpreston commented 11 months ago

@liammulh : when you're back next week can we chat about a new approach to this project? Thanks

wrightmw commented 7 months ago

Now we're bringing in via LDH... switch this project to Bryan so he can look into these data in the LDH