Project workflows - Githubissues

WoutVanEynde commented 2 years ago

Hello,

I had some questions regarding setting up my own project using the Reinforcement_Learning notebook:

The scoring function makes use of the Aurora kinase model in the demo. If I would like to change the target, do I create my own model using the [Create_Model_Demo.ipynb] notebook?
The prior and agent, are they always trained using the transfer learning notebook that uses the same model as created by the [Create_Model_Demo.ipynb] notebook? If not, what do I use?
The [Create_Model_Demo.ipynb] notebook uses different smiles to create the model, if I were to create a model for a specific target, do I use smiles from known actives as input? And if the target doesn't have any known actives, how do I progress?
What is the difference between the [Create_Model_Demo.ipynb] notebook and the [Model_Building_Demo.ipynb] notebook?
In general which notebooks should be used in what order to get the best out of REINVENT if I were to start a project for a target with no known OR with known actives?
Around how big should the smiles dataset be?

I hope these questions are clear and not too much of a problem!

Thank you for your time and warm regards, Wout Van Eynde

GuoJeff commented 2 years ago

Hi @WoutVanEynde,

The Model_Building_Demo notebook goes through an example of constructing your own QSAR model. The Create_Model_Demo notebook trains a new prior/agent
Yes, the Create_Model_Demo creates the initial prior/agent generative model by generating its vocabulary based on the input training data. The vocabulary dictates what tokens the model is capable of proposing which directly controls the possible atom types in the output SMILES. The Transfer_Learning notebook trains the generative model and the final output from this notebook is what you can use in the "prior" and "agent" fields in the reinforcement learning configuration JSON.
If you want to train a QSAR model for your target, you should use the Model_Building_Demo notebook. The SMILES in the notebook are transformed into fingerprints and act as the input. The output in that specific notebook is binary activity (0 if inactive and 1 if active). If your target doesn't have any known actives, then you may want to evaluate if training a QSAR model would be a good way forward as you need to trust that the model's output at the very least, correlates with what you are trying to achieve (for example, activity). Alternatively, you could train a QSAR model to output "predicted activity" measured via docking scores for example, where a "better" docking score is considered to be "more active". While possible, this makes the assumption that "better" docking scores do indeed mean "more active", which may very well not be the case. I would advise against this, but just wanted to show that it is still possible to train a QSAR model if you wanted to.
Create_Model creates the vocabulary of the base generative model (prior/agent). Model_Building builds QSAR models that can be used in the Scoring Function of REINVENT
In the case of no known actives:
- you could generate molecule ideas completely de novo, by running REINVENT with reinforcement learning (and optimizing for certain molecular properties you would want to see, such as predicted solubility). The output results could then be analyzed and put through more expensive physics-based approximations of binding affinity/molecular dynamic simulations to gain a better prediction of whether those molecules do indeed bind. In the case of known actives:
- you could run REINVENT with reinforcement learning with a trained QSAR model based on the actives. Alternatively, you could validate a docking protocol (see if the docking score can distinguish between actives and inactives). If yes, you could use docking as an approximation to activity and have REINVENT optimize docking scores
For creating your own generative model from scratch, I would recommend having SMILES in the 100,000 or larger. If training a QSAR model, even a few hundred SMILES (with known activity) is possible but generally, the more data, the better.

I hope this is helpful and thank you for your interest in REINVENT.

WoutVanEynde commented 2 years ago

The continuation of these comments can be found in the ReinventCommunity repository (https://github.com/MolecularAI/ReinventCommunity/issues/29).

MolecularAI / Reinvent

Project workflows #42