NREL / EvoProtGrad

Directed evolution of proteins in sequence space with gradients
https://nrel.github.io/EvoProtGrad/
BSD 3-Clause "New" or "Revised" License
44 stars 6 forks source link

is it possible to get the importance score of the protein sequence? #3

Closed anonimoustt closed 2 hours ago

anonimoustt commented 5 months ago

I was just wondering is it possible to get the importance score of the protein sequence using EvoProtGrad model? For instance, in https://huggingface.co/datasets/waylandy/phosformer_curated data there are kinase enzymes. Now I want to rank the kinase enzymes based on importance scores.

Furthermore, I found in (https://colab.research.google.com/drive/1e8WjYEbWiikRQg3g4YHQJJcpvTIWVAjp?usp=sharing) that the scores are generated for different variants of a protein sequence. But what is the score of the original protein sequence ? If the score of original sequence can be measured then it can be compared with other variants?

pemami4911 commented 5 months ago

Hi, sorry for the delay in getting back to you!

The score of the original protein sequence (i.e., the wild type sequence specified via the wt_fasta or wt_protein arguments of the DirectedEvolution sampler class), is stored in this wt_score attribute within each expert. Each expert uses this wt_score to compute the relative score of a variant with respect to the wild type.

As to getting importance scores of each variant, the DirectedEvolution sampler will return both the list of variants and their corresponding scores as a tuple. You can see in the demo notebook--when the output argument is set to "all", the scores tensor will have shape [parallel_chains, steps], and it's up to you to decide whether to grab the last score for each variant (scores[:,-1]) or the best, etc.

anonimoustt commented 5 months ago

It is not clear. Specifically, from the code variants, scores = evo_prot_grad.DirectedEvolution( wt_protein = wildtype_sequence, output = 'best', # return best, last, all variants experts = [expert], # list of experts to compose parallel_chains = 2, # number of parallel chains to run n_steps = 100, # number of MCMC steps per chain max_mutations = -1, # maximum number of mutations per variant preserved_regions = None, # List of regions (start,end) to preserve verbose = False # print debug info to command line )()

wtseq = ' '.join(wildtype_sequence.strip())

for v,s in zip(variants,scores): evo_prot_grad.common.utils.print_variant_in_color(v, wtseq) print(s)

if I set output = 'all', then I will get the original sequence with score along with variant right?

pemami4911 commented 5 months ago

No, scores will only contain a score for each variant, even if output is set to all. Here, all refers to returning the intermediate scores of the variants at each sampling step. In this example, scores would have shape [2,100] since parallel_chains = 2 and n_steps = 100. If having the wildtype sequence's score returned alongside the scores of each variant is useful, I can add that.

anonimoustt commented 5 months ago

Hi, Yes it would be helpful if the score of the original sequence can be determined. I did not understand scores would have shape [2,100]. I see the score in float number format. parallel_chains = 2 defines top two best variants based on score right. Would you please clarify?

Also how was the score computed? Are you taking embedding: let us say using ESM-2 model you are computing the embedding of original sequence, and its variants . Next, we are computing the cosine similarity?

pemami4911 commented 4 months ago

I think it could help to spend a little time reading the documentation about what scores are in EvoProtGrad and how they are estimated: https://nrel.github.io/EvoProtGrad/getting_started/experts/#what-is-a-product-of-experts ! The score in EvoProtGrad is an unnormalized log probability. However, in practice we subtract the wild type sequence log prob from the variant log prob, so the score actually is a difference between log probs.

The shape of the scores tensor will vary depending on what you set the argument output to. If output = best or output = last, that means for each of the parallel_chains Markov chains, either the best/last (respectively) variants will be returned. Hence, scores has shape [parallel_chains]. When output = all, this means every variant produced by each Markov chain at each step 1..n_steps will be returned, hence scores has shape [parallel_chains, n_steps]. This is useful when entire distributions of "good" variants are desired instead of just point estimates of "good" variants.

anonimoustt commented 4 months ago

Thanks. EvoProtGrad is really interesting. I am working on kinase domain sequences ( https://huggingface.co/datasets/waylandy/phosformer_curated/raw/main/curated/phosphosites_11mer_kinase_specific.tsv). EvoProtGrad might be interesting tool to get the variants of a kinase sequence for analysis.

anonimoustt commented 4 months ago

Hi one more query: Can EvoProtGrad be used to detection significant connection between two protein sequences? Let us say, I have protein 1 and protein 2 two sequences. Now using EvoProtGrad I got the top 3 variants of protein1 and top 3 variants of protein 2. Then compute the similarity scores of the variants is it possible get the relational significance of the protein 1 and protein 2.

anonimoustt commented 4 months ago

Hi ,

I see if parallel_chains = 5, then I see the 5 variants and the corresponding score. Higher the score means more closer to the original sequence?

pemami4911 commented 2 hours ago

Accessing a particular expert's score for a variant sequence is now easier in v0.2 https://github.com/NREL/EvoProtGrad/releases/tag/v0.2. You can now call get_model_output with an expert to get this particular expert's score https://nrel.github.io/EvoProtGrad/api/experts/.