OATML / EVE

Official repository for the paper "Large-scale clinical interpretation of genetic variants using evolutionary data and deep learning". Joint collaboration between the Marks lab and the OATML group.
http://evemodel.org/
MIT License
60 stars 54 forks source link

Why is EVE's score of a protein missing a large fragment? Like the HCN4 protein #9

Open Licko0909 opened 1 year ago

Licko0909 commented 1 year ago

Hello, author! EVE is a very good job, Thank you so much for your contributions to the community.

I recently encountered some problems when I was using EVE to score genetic variation. I am looking forward to your reply very much!

I would like to ask the following three questions:

1)What are "_ASM" and "_BPU"? Is there a help document that describes information for each column? When the two results are different, which one should be chosen? For example, csv files for PTEN

image

image

2)What transcripts do the 3,000 + proteins on EVE's website refer to? Because I found that the different transcription, variation of the corresponding amino acid is different, I refer to is MAEN project (refer to the link: http://tark.ensembl.org/web/mane_project/)

The Matched Annotation from NCBI and EMBL-EBI (MANE) is a collaboration that aims to converge on human gene annotation and to produce a genome wide transcript set that includes pairs of RefSeq (NM) and Ensembl/GENCODE (ENST) transcripts that are 100% identical.

3)Why is EVE's score of a protein missing a large fragment? Like the HCN4 protein (https://evemodel.org/proteins/HCN4_HUMAN) image

I am looking forward to your reply very much!

Kind regards, Licko

pascalnotin commented 1 year ago

Hi @Licko0909,

Thank you for the kind words! To answer your questions:

  1. To compute mutation scores with EVE (ie., to estimate the delta ELBO quantity) we sample a large number of times from the approximate posterior from the VAE. When developing EVE, we tested the importance that the number of samples would have on the final prediction performance: our default setting is 20k samples; we tested up to 200k samples. Since computing scores w/ 200k samples is computationally intensive, we only performed that analysis on the mutations with entries in ClinVar (ie., BPU -- for "Benign, Pathogenic or Uncertain" labels; that's where the "_BPU" suffix is coming from). We observed only a slight performance increase going from 20k samples to 200k samples, so we kept 20k samples as our default when generating scores for all possible single mutants (_ASM stands for "All single mutants"). There is no other difference between the two sets of numbers (besides number of samples): the _BPU version should be marginally better but is only available for a subset of variants (those with an entry in ClinVar).
  2. @jonnyfrazer -- could you please clarify re: transcript used?
  3. In the original EVE architecture (used in our paper), we were only leveraging positions that were sufficiently-covered in the MSA. Consequently, we were only able to score mutations at these well-covered positions. I have not checked these particular positions, but the fragments with missing scores in your example should (most likely) correspond to positions that were not sufficiently covered in the corresponding MSA. With that said, there is no strict constraint to model only well-covered positions, and we have observed in practice that also including the non well-covered positions in the EVE models was 1) not detrimental to the predictive performance when scoring mutations at well-covered positions only 2) would allow to score mutations at non well-covered positions as well. We are working on updated versions on our models that would address this current limitation.

Kind regards, Pascal

Licko0909 commented 1 year ago

Thank you so much! Glad to hear from you! Your answer has answered many of my doubts, Thank you very much! I am also looking forward to the reply to the second question