Closed anonimoustt closed 2 hours ago
Hi, sorry for the delay in getting back to you!
The score of the original protein sequence (i.e., the wild type sequence specified via the wt_fasta
or wt_protein
arguments of the DirectedEvolution sampler class), is stored in this wt_score
attribute within each expert. Each expert uses this wt_score
to compute the relative score of a variant with respect to the wild type.
As to getting importance scores of each variant, the DirectedEvolution sampler will return both the list of variants and their corresponding scores as a tuple. You can see in the demo notebook--when the output
argument is set to "all"
, the scores
tensor will have shape [parallel_chains, steps]
, and it's up to you to decide whether to grab the last score for each variant (scores[:,-1]
) or the best, etc.
It is not clear. Specifically, from the code variants, scores = evo_prot_grad.DirectedEvolution( wt_protein = wildtype_sequence, output = 'best', # return best, last, all variants experts = [expert], # list of experts to compose parallel_chains = 2, # number of parallel chains to run n_steps = 100, # number of MCMC steps per chain max_mutations = -1, # maximum number of mutations per variant preserved_regions = None, # List of regions (start,end) to preserve verbose = False # print debug info to command line )()
wtseq = ' '.join(wildtype_sequence.strip())
for v,s in zip(variants,scores): evo_prot_grad.common.utils.print_variant_in_color(v, wtseq) print(s)
if I set output = 'all', then I will get the original sequence with score along with variant right?
No, scores
will only contain a score for each variant, even if output
is set to all
. Here, all
refers to returning the intermediate scores of the variants at each sampling step. In this example, scores would have shape [2,100]
since parallel_chains = 2
and n_steps = 100
.
If having the wildtype sequence's score returned alongside the scores of each variant is useful, I can add that.
Hi, Yes it would be helpful if the score of the original sequence can be determined. I did not understand scores would have shape [2,100]. I see the score in float number format. parallel_chains = 2 defines top two best variants based on score right. Would you please clarify?
Also how was the score computed? Are you taking embedding: let us say using ESM-2 model you are computing the embedding of original sequence, and its variants . Next, we are computing the cosine similarity?
I think it could help to spend a little time reading the documentation about what scores
are in EvoProtGrad and how they are estimated: https://nrel.github.io/EvoProtGrad/getting_started/experts/#what-is-a-product-of-experts ! The score in EvoProtGrad is an unnormalized log probability. However, in practice we subtract the wild type sequence log prob from the variant log prob, so the score actually is a difference between log probs.
The shape of the scores
tensor will vary depending on what you set the argument output
to. If output = best
or output = last
, that means for each of the parallel_chains
Markov chains, either the best/last (respectively) variants will be returned. Hence, scores
has shape [parallel_chains]
. When output = all
, this means every variant produced by each Markov chain at each step 1..n_steps
will be returned, hence scores
has shape [parallel_chains, n_steps]
. This is useful when entire distributions of "good" variants are desired instead of just point estimates of "good" variants.
Thanks. EvoProtGrad is really interesting. I am working on kinase domain sequences ( https://huggingface.co/datasets/waylandy/phosformer_curated/raw/main/curated/phosphosites_11mer_kinase_specific.tsv). EvoProtGrad might be interesting tool to get the variants of a kinase sequence for analysis.
Hi one more query: Can EvoProtGrad be used to detection significant connection between two protein sequences? Let us say, I have protein 1 and protein 2 two sequences. Now using EvoProtGrad I got the top 3 variants of protein1 and top 3 variants of protein 2. Then compute the similarity scores of the variants is it possible get the relational significance of the protein 1 and protein 2.
Hi ,
I see if parallel_chains = 5, then I see the 5 variants and the corresponding score. Higher the score means more closer to the original sequence?
Accessing a particular expert's score for a variant sequence is now easier in v0.2 https://github.com/NREL/EvoProtGrad/releases/tag/v0.2. You can now call get_model_output
with an expert to get this particular expert's score https://nrel.github.io/EvoProtGrad/api/experts/.
I was just wondering is it possible to get the importance score of the protein sequence using EvoProtGrad model? For instance, in https://huggingface.co/datasets/waylandy/phosformer_curated data there are kinase enzymes. Now I want to rank the kinase enzymes based on importance scores.
Furthermore, I found in (https://colab.research.google.com/drive/1e8WjYEbWiikRQg3g4YHQJJcpvTIWVAjp?usp=sharing) that the scores are generated for different variants of a protein sequence. But what is the score of the original protein sequence ? If the score of original sequence can be measured then it can be compared with other variants?