HySonLab / Directed_Evolution

Protein Design by Machine Learning guided Directed Evolution
https://www.biorxiv.org/content/10.1101/2023.11.28.568945v1
GNU General Public License v3.0
26 stars 2 forks source link

Can the model's predictions of values outside the range of the training data labels be trusted? #1

Closed imSeaton closed 4 months ago

imSeaton commented 8 months ago

Thank you for your work. I think integrating protein language models into Directed Evolution is a good idea.

However, I have a question: In your paper, Directed Evolution has significantly increased the fitness of mutated seqs, but these fitness values seem to be based on predictions from models trained on the training dataset. However, the final fitness prediction values ended up exceeding the range of the training data labels. Is this reasonable? For example, I have analyzed the fitness range of the avGFP training data you provided, which is 1.28 to 4.12. I believe that the model might not be accurate in predicting fitness values exceeding 4.12 (due to the lack of training data and the model's disregard for out-of-distribution data), yet your Directed Evolution method has improved the avGFP's fitness to an out-of-distribution value of 11.796, which seems unreasonable. What are your thoughts on this?

Lastly, I want to say again that guiding directed evolution with the residue probability distribution of protein language models is a good idea. Thank you for your work and the open-source code.

thanhtvt commented 4 months ago

Thank you for your question, and I apologize for the late response.

Firstly, regarding the result of 11.796, we have updated it to 9.5 in our latest manuscript. This adjustment is due to changes in our evaluation approach, where we used the second oracle model instead of the optimization model (i.e., ESM-2 with 35M parameters) to assess the performance. Secondly, concerning the reliability of this result, please note that these are in-silico findings, which need to be validated through wet-lab experiments. This is the only way to verify the effectiveness of our method, as well as others.