Closed imSeaton closed 4 months ago
Thank you for your question, and I apologize for the late response.
Firstly, regarding the result of 11.796, we have updated it to 9.5 in our latest manuscript. This adjustment is due to changes in our evaluation approach, where we used the second oracle model instead of the optimization model (i.e., ESM-2 with 35M parameters) to assess the performance. Secondly, concerning the reliability of this result, please note that these are in-silico findings, which need to be validated through wet-lab experiments. This is the only way to verify the effectiveness of our method, as well as others.
Thank you for your work. I think integrating protein language models into Directed Evolution is a good idea.
However, I have a question: In your paper, Directed Evolution has significantly increased the fitness of mutated seqs, but these fitness values seem to be based on predictions from models trained on the training dataset. However, the final fitness prediction values ended up exceeding the range of the training data labels. Is this reasonable? For example, I have analyzed the fitness range of the avGFP training data you provided, which is 1.28 to 4.12. I believe that the model might not be accurate in predicting fitness values exceeding 4.12 (due to the lack of training data and the model's disregard for out-of-distribution data), yet your Directed Evolution method has improved the avGFP's fitness to an out-of-distribution value of 11.796, which seems unreasonable. What are your thoughts on this?
Lastly, I want to say again that guiding directed evolution with the residue probability distribution of protein language models is a good idea. Thank you for your work and the open-source code.