datamol-io / safe

A single model for all your molecular design tasks
https://safe-docs.datamol.io/
Apache License 2.0
86 stars 9 forks source link

Inquiry Regarding Reverse Molecular Design and Comparison of Models #20

Closed YanChen32 closed 11 months ago

YanChen32 commented 11 months ago

Hi Emmanuel Noutahi,

I trust this message finds you well. I recently came across your article on the impressive performance of the representation SAFE in reverse molecular design. I have a few questions and would greatly appreciate your insights.

Firstly, in your comparison of the performance of different large pretrained models on molecules, I noticed the absence of MOLGPT, which is known for its exceptional performance. Given MOLGPT's ability to conduct conditional generation on targeted fragments or properties, my first question is about the performance comparison between SAFE and MOLGPT (e.g., Table 2).

Secondly, could you shed some light on the comparison between SAFE and MOLGPT in terms of their capabilities for conditional generation on targeted fragments or properties?

Lastly, I am curious about the choice not to employ conditional generation, as seen in MOLGPT, and instead adopt Proximal Policy Optimization (PPO) for goal-directed generative tasks. Additionally, it appears that the PPO-related programs are not open-sourced. Could you provide some insights into the rationale behind this choice?

Thank you in advance for your time and consideration. I look forward to hearing from you soon.

Best,

Yan Chen

maclandrol commented 11 months ago

Hi @YanChen32,

Thanks for your interest in SAFE. Please find detailed answers below:

Firstly, in your comparison of the performance of different large pretrained models on molecules, I noticed the absence of MOLGPT, which is known for its exceptional performance. Given MOLGPT's ability to conduct conditional generation on targeted fragments or properties, my first question is about the performance comparison between SAFE and MOLGPT (e.g., Table 2).

MolGPT is a notable work and one of the few investigating scaffold-constrained generation, which we acknowledged in our paper. In our comparative analysis for pure de novo design (specifically in Table 2), MolGPT is listed as LigGPT.

In the initial paper, MolGPT was referred to as LigGPT by its authors (see https://chemrxiv.org/engage/api-gateway/chemrxiv/assets/orp/resource/item/60c7588e469df48597f456ae/original/lig-gpt-molecular-generation-using-a-transformer-decoder-model.pdf). It does seem that the authors have updated their performance report on MOSES between the first version ChemRxiv and the version published in ACS (https://pubs.acs.org/doi/epdf/10.1021/acs.jcim.1c00600). Nevertheless our results stand in terms of performance comparison. For your convenience, I have added a screenshot of Table 1 from the ACS version of MolGPT here:

image

Please note that a direct comparison between SAFE and MolGPT is challenging due to their differing training methods. SAFE-GPT's is a pure unconditional generative model leveraging the SAFE representation approach to streamline fragment-based molecule design for a wide range of applications. MolGPT has been trained with a conditional generation setup, and thus its generative capabilities cannot really be compared directly to pure generative models.

Secondly, could you shed some light on the comparison between SAFE and MOLGPT in terms of their capabilities for conditional generation on targeted fragments or properties?

First, I would like to highlight that it's possible to replace the SMILES strings used in MolGPT by SAFE strings and achieve the same setup. However what we argue about is that the SAFE representation inherently eliminates the necessity for scaffold conditioning to generate scaffold-aware molecules.

Furthermore unlike SAFE-GPT:

However, unlike MolGPT:

Integrating property-conditioned generation with SAFE-GPT could create a model proficient in both property- and structure-conditioned generation.

Lastly, I am curious about the choice not to employ conditional generation, as seen in MOLGPT, and instead adopt Proximal Policy Optimization (PPO) for goal-directed generative tasks. Additionally, it appears that the PPO-related programs are not open-sourced. Could you provide some insights into the rationale behind this choice?

Conditional generation, despite its merits, has inherent limitations, especially in terms of downstream applicability. Pre-training models on parameters like QED and SAS (heuristics with limited use in drug discovery), can narrow their broader application. Therefore, we've prioritized unconditional pre-training, allowing users the flexibility to either fine-tune in a conditional setup or use goal-directed optimization algorithms.

PPO is just one example among many goal-directed optimization algorithms suitable for sequence-based models. You could consider any other RL algorithms (for e.g. the one used in Reinvent), GFlowNets, greedy hill climbing, CMA-ES or Bayesian optimization. We mainly choose PPO because of the trl repo that works out of the box with transformers models.

Our primary focus in this repository is on the representation and the general purpose SAFE-GPT model. There is a large body of packages available to optimize transformers-based sequence models, but I would be happy to share our PPO setup. As a side note, I'm developing a comprehensive, open-source package for molecular design. This package will feature various optimization algorithms, requiring users only to input their predictive models and select a base generative model. Stay tuned.

Emmanuel

YanChen32 commented 11 months ago

Hi @maclandrol. I appreciate your clarification and detailed response. I'm particularly excited about the prospects of your integrated package featuring optimization algorithms, and I eagerly await its release.