MolecularAI / ReinventCommunity

MIT License
150 stars 56 forks source link

Project workflows #29

Closed WoutVanEynde closed 1 year ago

WoutVanEynde commented 2 years ago

Hello,

I had some questions regarding setting up my own project using the Reinforcement_Learning notebook:

I hope these questions are clear and not too much of a problem!

Thank you for your time and warm regards, Wout Van Eynde

GuoJeff commented 2 years ago

Hi @WoutVanEynde,

  1. The Model_Building_Demo notebook goes through an example of constructing your own QSAR model. The Create_Model_Demo notebook trains a new prior/agent
  2. Yes, the Create_Model_Demo creates the initial prior/agent generative model by generating its vocabulary based on the input training data. The vocabulary dictates what tokens the model is capable of proposing which directly controls the possible atom types in the output SMILES. The Transfer_Learning notebook trains the generative model and the final output from this notebook is what you can use in the "prior" and "agent" fields in the reinforcement learning configuration JSON.
  3. If you want to train a QSAR model for your target, you should use the Model_Building_Demo notebook. The SMILES in the notebook are transformed into fingerprints and act as the input. The output in that specific notebook is binary activity (0 if inactive and 1 if active). If your target doesn't have any known actives, then you may want to evaluate if training a QSAR model would be a good way forward as you need to trust that the model's output at the very least, correlates with what you are trying to achieve (for example, activity). Alternatively, you could train a QSAR model to output "predicted activity" measured via docking scores for example, where a "better" docking score is considered to be "more active". While possible, this makes the assumption that "better" docking scores do indeed mean "more active", which may very well not be the case. I would advise against this, but just wanted to show that it is still possible to train a QSAR model if you wanted to.
  4. Create_Model creates the vocabulary of the base generative model (prior/agent). Model_Building builds QSAR models that can be used in the Scoring Function of REINVENT
  5. In the case of no known actives:
    • you could generate molecule ideas completely de novo, by running REINVENT with reinforcement learning (and optimizing for certain molecular properties you would want to see, such as predicted solubility). The output results could then be analyzed and put through more expensive physics-based approximations of binding affinity/molecular dynamic simulations to gain a better prediction of whether those molecules do indeed bind. In the case of known actives:
    • you could run REINVENT with reinforcement learning with a trained QSAR model based on the actives. Alternatively, you could validate a docking protocol (see if the docking score can distinguish between actives and inactives). If yes, you could use docking as an approximation to activity and have REINVENT optimize docking scores
  6. For creating your own generative model from scratch, I would recommend having SMILES in the 100,000 or larger. If training a QSAR model, even a few hundred SMILES (with known activity) is possible but generally, the more data, the better.

I hope this is helpful and thank you for your interest in REINVENT.

WoutVanEynde commented 2 years ago

Hello,

First of all thanks a lot! I don't have any experience in the field of AI nor in programming, so this helped me to really solve the last pieces of the puzzle!

Just some last questions, if that is alright:

I hope I do not bother you too much with these questions!

Warm regards and thanks in advance, Wout Van Eynde, student KU Leuven

GuoJeff commented 2 years ago

Hi @WoutVanEynde,

  1. A pre-trained generative model is provided in this repository in models/random.prior.new. It can be used as is in the "prior" and "agent" fields in the REINVENT configuration JSON. It is trained on ChEMBL, which is an open-source database for biologically active molecules. More information can be found on their website: https://www.ebi.ac.uk/chembl/

  2. Docking scores can be directory optimized in REINVENT using DockStream which is a wrapper software around various docking algorithms. Using DockStream as a component to the scoring function in REINVENT will allow you to directly optimize docking scores. There is a tutorial notebook in this repository: Reinforcement_Learning_Demo_DockStream. In that particular notebook, it goes over the bare minimum of how to set-up the DockStream component in the REINVENT scoring function to optimize docking scores. The docking algorithm used there is Glide which is licensed by Schrodinger. Therefore, you would need a license to use that. However, DockStream supports a total of 5 docking backends: AutoDock Vina, rDock, GOLD, Glide, and OpenEye Hybrid. AutoDock Vina and rDock are open-source docking software. For information on how to set-up the configuration, see the DockStream and DockStreamCommunity repositories which are part of the MolecularAI group. The latter has tutorial notebooks just like this repository.

Once you have a docking protocol chosen and set-up, you will need to choose a score transformation. Every component in REINVENT is transformed to a score [0, 1]. Therefore, raw docking scores will also need to be transformed: see the Score_Transformations notebook in this repository for details.

Finally, running docking directly in REINVENT will take longer computation time as every single SMILES proposed at every single epoch needs to be run through the docking algorithm. The time it takes will depend on which docking algorithm you use.

Let me know if this helps you set-up your experiment.

WoutVanEynde commented 2 years ago

Dear

This has helped me a lot! I cannot express my gratitude enough! I will try to set up a workflow the coming days, but I think I should be fine now!

Warm regards, Wout Van Eynde

xuzhang5788 commented 2 years ago

@WoutVanEynde If possible, please list your workflow here for your new project. I think that it will help the other users a lot. Many thanks.

fangffRS commented 2 years ago

Hi,

For the pre-trained generative model provided in this repository in models/random.prior.new, could you please provide more details of the training process/protocol of this chembl prior model? I only found the data preparation of the chembl dataset in your relative papers but have no info about the training process, something like first create empty model using the purged dataset then transfer learning using the empty model with the same purged dataset? Also, how about the parameter setting to get this random.prior.new model? I would very much appreciate it if you could provide more details for it!