smiles could not be converted in libinvent

ichxw commented 3 days ago

Hello, I was trying to run libinvent and failed due to an issue of smiles conversion. Here the part of the toml file. Others are the same as the staged_learning.toml.

# REINVENT4 TOML input example for reinforcement/curriculum learning
#
#
# Curriculum learning in REINVENT4 is a multi-stage reinforcement learning
# run.  One or more stages (auto CL) can be defined.  But it is also
# possible to continue a run from any checkpoint file that is generated
# during the run (manual CL).  Currently checkpoints are written at the end
# of a run also when the run is forcefully terminated with Ctrl-C.

run_type = "staged_learning"
device = "cuda:0"  # set torch device e.g. "cpu"
tb_logdir = "tb_logs"  # name of the TensorBoard logging directory
json_out_config = "_staged_learning.json"  # write this TOML to JSON

[parameters]
# Uncomment one of the comment blocks below.  Each generator needs a model
# file and possibly a SMILES file with seed structures.  If the run is to
# be continued after termination, the agent_file would have to be replaced
# with the checkpoint file.

summary_csv_prefix = "staged_learning"  # prefix for the CSV file
use_checkpoint = false  # if true read diversity filter from agent_file
purge_memories = false  # if true purge all diversity filter memories after each stage

## Reinvent
#prior_file = "priors/reinvent.prior"
#agent_file = "priors/reinvent.prior"

## LibInvent
prior_file = "priors/libinvent.prior"
agent_file = "priors/libinvent.prior"
smiles_file = "scaffolds.smi"  # 1 scaffold per line with attachment points

After running I got an error as below:

$ reinvent -l lib.log staged_learning_test_lib.toml
Traceback (most recent call last):
  File "/home/***/anaconda3/envs/reinvent4/bin/reinvent", line 8, in <module>
    sys.exit(main())
  File "/home/***/anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/Reinvent.py", line 334, in main
    runner(
  File "/home/***/anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/runmodes/RL/run_staged_learning.py", line 367, in run_staged_learning
    terminate = optimize(package.terminator)
  File "/home/***/anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/runmodes/RL/learning.py", line 125, in optimize
    scaffolds = self._state.diversity_filter.update_score(
  File "/home/***/anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/runmodes/RL/memories/identical_murcko_scaffold.py", line 17, in update_score
    return self.score_scaffolds(scores, smilies, mask, topological=False)
  File "/home/***/anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/runmodes/RL/memories/diversity_filter.py", line 84, in score_scaffolds
    smiles = smilies[i]
IndexError: list index out of range

Looking at the log file, looks like there was a problem in recognizing the scaffold smiles strucutures:

16:19:42 <INFO> Creating scoring component QED
16:19:42 <INFO> Writing tabular data for stage to staged_learning_1.csv
16:19:42 <INFO> Starting stage 1 <<<
16:19:42 <INFO> Current GPU memory usage: 884 MiB used, 39562 MiB free
16:19:43 <WARN> reinvent_plugins.normalizers.rdkit_smiles: [*]Cc1c2cc(C[*])cnc2cnc1C1CN2CC(*)C1CC(OCc1c(-c3ccc(F)cc3)ccc1)C2O|*N(C)C1CCCNC1 could not be converted
16:19:43 <WARN> reinvent_plugins.normalizers.rdkit_smiles: n1cc(C[*])c2cc(C[*])ccc2c1c1c(*)c2c(c(-c3cc(C)c(OCc4nn(C)c(=O)c4CO)c(OC)cc3)cc2)cc1|*C could not be converted

The rdkit version I'm using is 2024.03.6. Any responses are appreciated. Thanks.

halx commented 2 days ago

Hi,

many thanks for your interest in REINVENT and welcome to the community!

I see two problems there:

You have a "|" in the SMILES but Libinvent requires a single scaffold as input.
Your scaffolds will not kekulize and you will have to fix the chemistry first.

Many thanks, Hannes.

ichxw commented 2 days ago

Hi Hannes, Thank you for your quick response. I used scaffolds.smi as the input file provided by the program. I tested the staged learning process multiple times, and each time, it produced the same error message. However, the output files varied: most of the time, staged_learning_1.csv was empty, but occasionally, it contained a few hundred SMILES lines. Below are the last few lines of the log file when staged_learning_1.csv was not empty.

09:36:53 <INFO> Creating scoring component QED
09:36:53 <INFO> Writing tabular data for stage to staged_learning_1.csv
09:36:53 <INFO> Starting stage 1 <<<
09:36:53 <INFO> Current GPU memory usage: 884 MiB used, 39562 MiB free
09:36:54 <INFO> Score: 0.74 Agent NLL: 20.46 Valid: 100% Step: 1
 | Agent Prior Target Score SMILES SMILES_state Input_Scaffold R-groups Scaffold Molecular weight Molecular weight (raw) Unwanted SMARTS Unwanted SMARTS (raw)
 | 16.6641 14.4043 113.5311 0.9994959 CC(C)Cc1cncc2ccc(CN3CCCCC3)cc12 1 c12c(C[*])cncc1ccc(C[*])c2 *C(C)C|C1CCCN(*)C1 c1cc2cc(CN3CCCCC3)ccc2cn1 0.9994959 282.4310 1.0000000 1.0000
 | 31.1224 32.3684 95.4699 0.9987366 CCc1ccc2cncc(CCCON=C(N)c3ccc(-n4cccn4)cc3CC)c2c1 1 c12c(C[*])cncc1ccc(C[*])c2 *CCON=C(c1ccc(-n2cccn2)cc1CC)N|*C C(=NOCCCc1cncc2ccccc12)c1ccc(-n2cccn2)cc1 0.9987366 427.5520 1.0000000 1.0000
 | 33.0129 30.6990 -30.6990 0.0000000 O=C(O)CC(CCO)c1ccc(CCc2ccc3cncc(CCCO)c3c2)c(Cl)c1 1 c12c(C[*])cncc1ccc(C[*])c2 *CCO|c1c(Cl)c(C*)ccc1C(CC(O)=O)CCO c1ccc(CCc2ccc3cnccc3c2)cc1 0.0000000 0.0000 0.0000000 0.0000
 | 11.6166 12.1508 107.0825 0.9315100 CCCc1ccc2cncc(CN(C)C)c2c1 1 c12c(C[*])cncc1ccc(C[*])c2 *N(C)C|*CC c1ccc2cnccc2c1 0.9315100 228.3390 1.0000000 1.0000
 | 10.3642 8.7407 56.0993 0.5065621 CCc1ccc2cncc(CNC)c2c1 1 c12c(C[*])cncc1ccc(C[*])c2 *NC|*C c1ccc2cnccc2c1 0.5065621 200.2850 1.0000000 1.0000
 | 6.6187 6.0158 22.1438 0.2199967 CCc1ccc2cncc(CN)c2c1 1 c12c(C[*])cncc1ccc(C[*])c2 *N|*C c1ccc2cnccc2c1 0.2199967 186.2580 1.0000000 1.0000
 | 21.0663 20.5225 107.4715 0.9999537 COc1ccc(CN)cc1Cc1cncc2ccc(CO)cc12 1 c12c(C[*])cncc1ccc(C[*])c2 *c1c(OC)ccc(CN)c1|O* c1ccc(Cc2cncc3ccccc23)cc1 0.9999537 308.3810 1.0000000 1.0000
 | 18.7851 18.1979 109.8013 0.9999936 Clc1ccc(Cc2cncc3ccc(Cc4nnn[nH]4)cc23)cc1Cl 1 c12c(C[*])cncc1ccc(C[*])c2 *c1ccc(Cl)c(Cl)c1|n1nn[nH]c1* c1ccc(Cc2cncc3ccc(Cc4nnn[nH]4)cc23)cc1 0.9999936 370.2430 1.0000000 1.0000
 | 13.8535 13.2819 114.6822 0.9997200 ClCCc1cncc2ccc(CN3CCCCC3)cc12 1 c12c(C[*])cncc1ccc(C[*])c2 C(*)Cl|C1N(*)CCCC1 c1cc2cc(CN3CCCCC3)ccc2cn1 0.9997200 288.8220 1.0000000 1.0000
 | 44.2730 44.1825 -43.7477 0.0033969 CCCc1cncc2ccc(CNc3nc4c(n3C)-c3cc(-c5cc(OC)ccc5C)c(CO)cc3C(=O)NC4)cc12 1 c12c(C[*])cncc1ccc(C[*])c2 C(*)C|c12nc(N*)n(C)c1-c1cc(-c3c(C)ccc(OC)c3)c(CO)cc1C(=O)NC2 O=C1NCc2nc(NCc3ccc4cnccc4c3)[nH]c2-c2cc(-c3ccccc3)ccc21 0.0033969 561.6860 1.0000000 1.0000
09:36:55 <WARN> reinvent_plugins.normalizers.rdkit_smiles: c1(C[*])cncc2c1cc(C[*])cc2*NC(=O)N|c1(*)c(C)nc(NC(c2c(Cl)ccc(N3CCC(F)(F)CCC(N)=NO)c2F)=O)nc1C could not be converted

Here is another log file without any output smiles.

09:50:27 <INFO> Creating scoring component QED
09:50:27 <INFO> Writing tabular data for stage to staged_learning_1.csv
09:50:27 <INFO> Starting stage 1 <<<
09:50:27 <INFO> Current GPU memory usage: 884 MiB used, 39562 MiB free
09:50:28 <WARN> reinvent_plugins.normalizers.rdkit_smiles: [*]Cc1cc2c(C[*])cncc2cc1*N1CCN(c2cc3c(c(=O)n3c(-c4ccc(NC(C)=O)cc4)nc(OC)n3)cn2)CC1C|*C could not be converted
09:50:28 <WARN> reinvent_plugins.normalizers.rdkit_smiles: c1(C[*])cc2c(cncc2C[*])nc1*NCCSc1c2sc(=O)cc-2c(O)cc1O|C* could not be converted

You can see the problematic smiles were actually generated from the program. Please let me know if there was something wrong in the configuration of toml input. I had listed the changes in REINVENT4/configs/toml/staged_learning.toml for libinvent early in this thread. Other part of the toml file was exactly the same as the original. Thanks for your time.

MolecularAI / REINVENT4

smiles could not be converted in libinvent #160