Closed kingljy0818 closed 3 months ago
Most likely reinvent/models/mol2mol/models/vocabulary.py
is not a symlink as it should be and the contents of the file is the path to the actual file. Either copy that file over or replace contents with
from reinvent.models.transformer.core.vocabulary import Vocabulary
However, the above error message was resolved by running conda install -c conda-forge rdkit pandas.
That is only a waning message due to RDKit not being able to cope with new Pandas versions (I believe versions 2.0 and above). Unless you use PandasTools there should be no impact.
Hi,
I still need your guidance and help to resolve this error message. Thank you very much.
You will need to copy the model file into that directory (see error message). You can find the download link for the file in the notebook.
Hi,
run_type = "staged_learning" device = "cuda:0" tb_logdir = "tb_stage1" json_out_config = "_stage1.json"
[parameters]
prior_file = "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/../priors/reinvent.prior" agent_file = "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/../priors/reinvent.prior" summary_csv_prefix = "stage1"
batch_size = 100
use_checkpoint = false
[learning_strategy]
type = "dap" sigma = 128 rate = 0.0001
[[stage]]
max_score = 1.0 max_steps = 300
chkpt_file = 'stage1.chkpt'
scoring_function.type = "custom_product"
[stage.scoring] type = "geometric_mean"
[[stage.scoring.component]] [stage.scoring.component.custom_alerts]
[[stage.scoring.component.custom_alerts.endpoint]] name = "Alerts"
params.smarts = [ "[;r8]", "[;r9]", "[;r10]", "[;r11]", "[;r12]", "[;r13]", "[;r14]", "[;r15]", "[;r16]", "[;r17]", "[#8][#8]", "[#6;+]", "[#16][#16]", "[#7;!n][S;!$(S(=O)=O)]", "[#7;!n][#7;!n]", "C#C", "C(=[O,S])[O,S]", "[#7;!n][C;!$(C(=[O,N])[N,O])][#16;!s]", "[#7;!n][C;!$(C(=[O,N])[N,O])][#7;!n]", "[#7;!n][C;!$(C(=[O,N])[N,O])][#8;!o]", "[#8;!o][C;!$(C(=[O,N])[N,O])][#16;!s]", "[#8;!o][C;!$(C(=[O,N])[N,O])][#8;!o]", "[#16;!s][C;!$(C(=[O,N])[N,O])][#16;!s]" ]
[[stage.scoring.component]] [stage.scoring.component.QED]
[[stage.scoring.component.QED.endpoint]] name = "QED" weight = 0.6
[[stage.scoring.component]] [stage.scoring.component.NumAtomStereoCenters]
[[stage.scoring.component.NumAtomStereoCenters.endpoint]] name = "Stereo" weight = 0.4
How should I modify it? In other words, I don't quite understand how the prior models such as reinvent.prior in Reinvent_demo.ipynb and Reinvent_TLRL.ipynb are obtained?I look forward to your reply. Thank you very much!
To create a new prior training you woul need to look into reinvent/runmodes/create_model/create_reinvent.py
. This creates an "empty" model with a pre-defined vocabulary. To actually train the model you would need to carry out TL with your dataset and I recomment to create a validation set. I would also suggest to have a look into Randomized SMILES strings improve the quality of molecular generative model to understand how prior models can be improved upon with augmentation. Please note, that data preparation is your responsibility as there is currently not much in place for that.
You probably also want to carefully consider why you need a new prior as it takes quite a bit of expertise to get this right. Chemical space coverage has probably not that much evolved in ChEMBL but if you want to support additional chemistry (the vocabulary is fixed) for example or think to support stereochemistry (but beware imbalanced data) then the current priors are limited in this.
I have found reinvent/runmodes/create_model/create_reinvent.py, but I still don't know how to create an empty model. In REINVENT 3.2, there was a Create_Model_Demo.ipynb notebook that could be used to create an empty model with Chembl33. Could you please guide me on how to create an empty model with Chembl33 in REINVENT 4?
I see that there are many pre-existing prior models in the Prior directory of REINVENT4, such as reinvent.prior. How are these models trained? Can these pre-existing prior models be used directly? How should each of these prior models in the Prior directory be used respectively? Is there a detailed usage guide? I would appreciate your continued guidance. Thank you!
I can suggest to read our paper Reinvent 4: Modern AI–driven generative molecule design and the papers cited therein.
create_reinvent.py
reads in a TOML configuration file. An example is in the same directory.
It's quite a coincidence. Before receiving your reply, I carefully read your paper "Reinvent 4: Modern AI-driven generative molecule design" published in the Journal of Cheminformatics this afternoon. I have a basic understanding of the logic and operation mechanism of REINVENT4. However, after reading this paper, there are still a few questions that need your guidance:
There are many pre-trained prior models in the Prior folder of REINVENT4. Can these models be directly applied to my own drug development scenarios?
How was the model.pt in the chemprop directory of Reinvent_TLRL.ipynb trained?
There are many pre-trained prior models in the Prior folder of REINVENT4. Can they handle most drug development scenarios? Is it necessary for me to train my own prior model, for example, training a prior model based on a database like Chembl34 which has 2 million compound structures?
There's one part in the parameters section of the stage1.toml file in Reinvent_TLRL.ipynb that I don't quite understand. Why are both prior_file and agent_file using reinvent.prior? Why isn't agent_file using an agent model?
In Reinvent_TLRL.ipynb, the input_model_file in the transfer_learning file uses the checkpoint file stage1.chkpt generated by stage1.toml. I want to ask if stage1.chkpt is equivalent to a prior model?
I look forward to your answers to these three questions. Thank you very much!
3.You only need to train a new prior if you have specific needs in terms of supported chemistry.
As for the other questions you will need to get that basic knowledge from the literature e.g. out paper. These things are not suitable for discussion in this forum.
In the Reinvent_TLRL.ipynb notebook, the model.pt is annotated with: "This is a model that has been trained on free energy simulation data computed for the TNKS2 target." I browsed through the ChemProp GitHub site and it seems that ChemProp does not have the capability to compute binding free energy. I'm having trouble understanding this annotation, and would appreciate further clarification.
ChemProp is software that allows the user to create deep learning models. The data to train on comes from the user. The model provided is just an example.
Hi,
I have correctly installed REINVENT4 and generated the Reinvent_TLRL.ipynb file in the notebook directory using the jupytext command. When running the cell in Reinvent_TLRL.ipynb:
%%time !reinvent -l stage1.log $stage1_config_filename
the following error message appears:
Traceback (most recent call last): File "/home/Anaconda3/envs/reinvent4/bin/reinvent", line 8, in
sys.exit(main())
File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/Reinvent.py", line 302, in main
runner(input_config, actual_device, tb_logdir, responder_config)
File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/runmodes/RL/run_staged_learning.py", line 248, in run_stagedlearning
adapter, , model_type = create_adapter(prior_model_filename, "inference", device)
File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/runmodes/create_adapter.py", line 49, in create_adapter
compatibility_setup(model)
File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/runmodes/create_adapter.py", line 120, in compatibility_setup
from reinvent.models.mol2mol.models.vocabulary import Vocabulary
ImportError: cannot import name 'Vocabulary' from 'reinvent.models.mol2mol.models.vocabulary' (/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/models/mol2mol/models/vocabulary.py)
I need your help to resolve this issue. Thank you very much!
Best regards,
Jiyuan