MolecularAI / REINVENT4

AI molecular design tool for de novo design, scaffold hopping, R-group replacement, linker design and molecule optimization.
Apache License 2.0
322 stars 79 forks source link

Error Encountered While Running Reinvent_TLRL.ipynb in REINVENT4 #88

Closed kingljy0818 closed 3 months ago

kingljy0818 commented 4 months ago

Hi,

I have correctly installed REINVENT4 and generated the Reinvent_TLRL.ipynb file in the notebook directory using the jupytext command. When running the cell in Reinvent_TLRL.ipynb:

%%time !reinvent -l stage1.log $stage1_config_filename

the following error message appears:


Traceback (most recent call last): File "/home/Anaconda3/envs/reinvent4/bin/reinvent", line 8, in sys.exit(main()) File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/Reinvent.py", line 302, in main runner(input_config, actual_device, tb_logdir, responder_config) File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/runmodes/RL/run_staged_learning.py", line 248, in run_stagedlearning adapter, , model_type = create_adapter(prior_model_filename, "inference", device) File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/runmodes/create_adapter.py", line 49, in create_adapter compatibility_setup(model) File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/runmodes/create_adapter.py", line 120, in compatibility_setup from reinvent.models.mol2mol.models.vocabulary import Vocabulary ImportError: cannot import name 'Vocabulary' from 'reinvent.models.mol2mol.models.vocabulary' (/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/models/mol2mol/models/vocabulary.py)

I need your help to resolve this issue. Thank you very much!

Best regards,

Jiyuan

halx commented 4 months ago

Most likely reinvent/models/mol2mol/models/vocabulary.py is not a symlink as it should be and the contents of the file is the path to the actual file. Either copy that file over or replace contents with

from reinvent.models.transformer.core.vocabulary import Vocabulary
kingljy0818 commented 4 months ago

Thank you very much for your response. The previous errors have been resolved, but when I run %%time !reinvent -l stage1.log $stage1_config_filename, a new message appears:

Failed to find the pandas get_adjustment() function to patch Failed to patch pandas - PandasTools will have limited functionality

However, the above error message was resolved by running conda install -c conda-forge rdkit pandas.

halx commented 4 months ago

That is only a waning message due to RDKit not being able to cope with new Pandas versions (I believe versions 2.0 and above). Unless you use PandasTools there should be no impact.

kingljy0818 commented 4 months ago

Hi,

While continuing to debug Reinvent_TLRL.ipynb, I encountered the following error when running the cell in the notebook:

%%time !reinvent -l stage2.log $stage2_config_filename

The error message is:

Traceback (most recent call last): File "/home/Anaconda3/envs/reinvent4/bin/reinvent", line 8, in sys.exit(main()) File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/Reinvent.py", line 302, in main runner(input_config, actual_device, tb_logdir, responder_config) File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/runmodes/RL/run_staged_learning.py", line 322, in run_staged_learning packages = create_packages(reward_strategy, stages) File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/runmodes/RL/run_staged_learning.py", line 178, in create_packages scoring_function = Scorer(scoring_config) File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/scoring/scorer.py", line 41, in init self.components = get_components(config["component"]) File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/scoring/config.py", line 94, in get_components component = Component(component_params) File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent_plugins/components/comp_chemprop.py", line 85, in init chemprop_args = chemprop.args.PredictArgs().parse_args(args) File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/tap/tap.py", line 478, in parse_args self.process_args() File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/chemprop/args.py", line 796, in process_args super(PredictArgs, self).process_args() File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/chemprop/args.py", line 190, in process_args self.checkpoint_paths = get_checkpoint_paths( File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/chemprop/args.py", line 58, in get_checkpoint_paths raise ValueError(f'Failed to find any checkpoints with extension "{ext}" in directory "{checkpoint_dir}"') ValueError: Failed to find any checkpoints with extension ".pt" in directory "/tmp/R4_notebooks_output/chemprop" CPU times: user 60.2 ms, sys: 21.9 ms, total: 82.1 ms Wall time: 5.48 s

I still need your guidance and help to resolve this error message. Thank you very much.

halx commented 4 months ago

You will need to copy the model file into that directory (see error message). You can find the download link for the file in the notebook.

kingljy0818 commented 4 months ago

Hi,

Thank you very much for your guidance. I have successfully run through every cell of both Reinvent_demo.ipynb and Reinvent_TLRL.ipynb, but I still need your help with some logical issues. If I want to generate a Prior model based on Chembl33, in the stage1.toml file of Reinvent_demo.ipynb, the code is as follows:

run_type = "staged_learning" device = "cuda:0" tb_logdir = "tb_stage1" json_out_config = "_stage1.json"

[parameters]

prior_file = "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/../priors/reinvent.prior" agent_file = "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/../priors/reinvent.prior" summary_csv_prefix = "stage1"

batch_size = 100

use_checkpoint = false

[learning_strategy]

type = "dap" sigma = 128 rate = 0.0001

[[stage]]

max_score = 1.0 max_steps = 300

chkpt_file = 'stage1.chkpt'

scoring_function.type = "custom_product"

[stage.scoring] type = "geometric_mean"

[[stage.scoring.component]] [stage.scoring.component.custom_alerts]

[[stage.scoring.component.custom_alerts.endpoint]] name = "Alerts"

params.smarts = [ "[;r8]", "[;r9]", "[;r10]", "[;r11]", "[;r12]", "[;r13]", "[;r14]", "[;r15]", "[;r16]", "[;r17]", "[#8][#8]", "[#6;+]", "[#16][#16]", "[#7;!n][S;!$(S(=O)=O)]", "[#7;!n][#7;!n]", "C#C", "C(=[O,S])[O,S]", "[#7;!n][C;!$(C(=[O,N])[N,O])][#16;!s]", "[#7;!n][C;!$(C(=[O,N])[N,O])][#7;!n]", "[#7;!n][C;!$(C(=[O,N])[N,O])][#8;!o]", "[#8;!o][C;!$(C(=[O,N])[N,O])][#16;!s]", "[#8;!o][C;!$(C(=[O,N])[N,O])][#8;!o]", "[#16;!s][C;!$(C(=[O,N])[N,O])][#16;!s]" ]

[[stage.scoring.component]] [stage.scoring.component.QED]

[[stage.scoring.component.QED.endpoint]] name = "QED" weight = 0.6

[[stage.scoring.component]] [stage.scoring.component.NumAtomStereoCenters]

[[stage.scoring.component.NumAtomStereoCenters.endpoint]] name = "Stereo" weight = 0.4

transform.type = "left_step" transform.low = 0

How should I modify it? In other words, I don't quite understand how the prior models such as reinvent.prior in Reinvent_demo.ipynb and Reinvent_TLRL.ipynb are obtained?I look forward to your reply. Thank you very much!

halx commented 4 months ago

To create a new prior training you woul need to look into reinvent/runmodes/create_model/create_reinvent.py. This creates an "empty" model with a pre-defined vocabulary. To actually train the model you would need to carry out TL with your dataset and I recomment to create a validation set. I would also suggest to have a look into Randomized SMILES strings improve the quality of molecular generative model to understand how prior models can be improved upon with augmentation. Please note, that data preparation is your responsibility as there is currently not much in place for that.

You probably also want to carefully consider why you need a new prior as it takes quite a bit of expertise to get this right. Chemical space coverage has probably not that much evolved in ChEMBL but if you want to support additional chemistry (the vocabulary is fixed) for example or think to support stereochemistry (but beware imbalanced data) then the current priors are limited in this.

kingljy0818 commented 4 months ago

I have found reinvent/runmodes/create_model/create_reinvent.py, but I still don't know how to create an empty model. In REINVENT 3.2, there was a Create_Model_Demo.ipynb notebook that could be used to create an empty model with Chembl33. Could you please guide me on how to create an empty model with Chembl33 in REINVENT 4?

I see that there are many pre-existing prior models in the Prior directory of REINVENT4, such as reinvent.prior. How are these models trained? Can these pre-existing prior models be used directly? How should each of these prior models in the Prior directory be used respectively? Is there a detailed usage guide? I would appreciate your continued guidance. Thank you!

halx commented 4 months ago

I can suggest to read our paper Reinvent 4: Modern AI–driven generative molecule design and the papers cited therein.

create_reinvent.py reads in a TOML configuration file. An example is in the same directory.

kingljy0818 commented 4 months ago

It's quite a coincidence. Before receiving your reply, I carefully read your paper "Reinvent 4: Modern AI-driven generative molecule design" published in the Journal of Cheminformatics this afternoon. I have a basic understanding of the logic and operation mechanism of REINVENT4. However, after reading this paper, there are still a few questions that need your guidance:

  1. There are many pre-trained prior models in the Prior folder of REINVENT4. Can these models be directly applied to my own drug development scenarios?

  2. How was the model.pt in the chemprop directory of Reinvent_TLRL.ipynb trained?

  3. There are many pre-trained prior models in the Prior folder of REINVENT4. Can they handle most drug development scenarios? Is it necessary for me to train my own prior model, for example, training a prior model based on a database like Chembl34 which has 2 million compound structures?

  4. There's one part in the parameters section of the stage1.toml file in Reinvent_TLRL.ipynb that I don't quite understand. Why are both prior_file and agent_file using reinvent.prior? Why isn't agent_file using an agent model?

  5. In Reinvent_TLRL.ipynb, the input_model_file in the transfer_learning file uses the checkpoint file stage1.chkpt generated by stage1.toml. I want to ask if stage1.chkpt is equivalent to a prior model?

I look forward to your answers to these three questions. Thank you very much!

halx commented 3 months ago
  1. The model in the notebook was trained with the sofware ChemProp.

3.You only need to train a new prior if you have specific needs in terms of supported chemistry.

  1. The prior_file serves as a reference for regularization in the loss function. The agent needs to start from somewhere and so the starting point is the prior. The prior does no change during RL but the agent does.
  2. A checkpoint file is simply the current state of the agent network model. It still has the same network hyper parameters but diferent, fine-tuned weights, biases, etc.

As for the other questions you will need to get that basic knowledge from the literature e.g. out paper. These things are not suitable for discussion in this forum.

kingljy0818 commented 3 months ago

In the Reinvent_TLRL.ipynb notebook, the model.pt is annotated with: "This is a model that has been trained on free energy simulation data computed for the TNKS2 target." I browsed through the ChemProp GitHub site and it seems that ChemProp does not have the capability to compute binding free energy. I'm having trouble understanding this annotation, and would appreciate further clarification.

halx commented 3 months ago

ChemProp is software that allows the user to create deep learning models. The data to train on comes from the user. The model provided is just an example.