NREL / EvoProtGrad

Directed evolution of proteins in sequence space with gradients
https://nrel.github.io/EvoProtGrad/
BSD 3-Clause "New" or "Revised" License
45 stars 7 forks source link

Using Masked Marginal Score for ESM-2 as a Scoring Method #2

Closed Amelie-Schreiber closed 2 weeks ago

Amelie-Schreiber commented 8 months ago

In the paper Language models enable zero-shot prediction of the effects of mutations on protein function the ESM folks introduce the "Masked Marginal Scoring" method to compute effects of mutations on function and show that it performs significantly better than the Log Likelihood Ratio (LLR) method. If I am not mistaken, LLR is used for EvoProtGrad currently. Could the code from the ESM github (where they use ESM-1v) be adapted to ESM-2 and used in EvoProtGrad as a scoring method? In particular, could the masked marginal scoring method found here be modified to work with ESM-2 and used in EvoProtGrad as the scoring method? The masked marginal score is defined as

$$ \sum_{i \in M} \log p(x_i = xi^{mt} | x{-M}) - \log p(x_i = xi^{wt} | x{-M}) $$

in the paper above, in Appendix A at the bottom of page 18, where $-M$ denotes the sequence with masking at all positions in $M$, where mutations occur. That is they introduce masks at the mutated positions (all at once) and compute the score for a mutation by considering its probability relative to the wildtype amino acid. This might significantly improve the scoring and could be a nice alternative scoring strategy.

pemami4911 commented 7 months ago

Hi, this could definitely be supported in EvoProtGrad with some minor modifications to the code in the HuggingFaceExpert class.

The masked marginal score uses the masked mutated protein sequence as input to the protein language model (PLM) and uses the (unmasked) mutations to compute the score. This could be supported by extending this __call__ function https://github.com/NREL/EvoProtGrad/blob/fda1d39d2106c252cb529400c0cdd790ac7b62df/evo_prot_grad/experts/base_experts.py#L117 to support passing auxiliary information about what the mutations are at the masked locations in the input (e.g., passing the the unmasked sequence).

Allowing the HuggingFaceExpert to have its score function be selected in an argument in the constructor seems reasonable as well, that way the __call__ function can select between different score functions on the fly.

Amelie-Schreiber commented 7 months ago

Would you be up for working on this with me? If so, what is the best way to get in touch?

pemami4911 commented 7 months ago

Definitely; I will be attending NeurIPS all next week so it's unlikely I'll have time to get to this until after that. In general, I am happy to review and accept pull requests too! 🙂

Feel free to send me an email at Patrick.Emami@nrel.gov to get in touch offline.

pemami4911 commented 2 weeks ago

Flexible variant scoring has been added in the latest v0.2 release (at long last)! https://github.com/NREL/EvoProtGrad/releases/tag/v0.2.

N.b. there is support for mutant_marginal scoring, but not masked_marginal, since adding mask tokens to the variant messes with the gradient computation with respect to the (unmasked) one-hot-encoded variant sequence.