kingljy0818 commented 1 week ago

Hello,

After running the following code in Reinvent4.3 and completing the training, we used TensorBoard to observe the training result curves. We found that the validation loss stabilized starting from step 85 and remained stable until the end of training at step 120. The training loss showed a continuous downward trend from the beginning to the end of the training at step 120, and it was consistently lower than the validation loss throughout this period. The sample loss showed a downward trend from the beginning until step 40, then increased from step 45 before decreasing again. From step 70 to step 100, the sample loss was lower than the training loss. After increasing at step 95, the sample loss started decreasing again from step 105. From step 115 to the end of training at step 120, the sample loss remained lower than the training loss.

Based on these observations, has this training converged? Is the generated Chembl34_filtered.prior model reasonable, and is there still room for optimization? Can it be used for subsequent transfer learning? Should the parameters in the following code be adjusted and optimized?

run_type = "transfer_learning" device = "cuda:0" tb_logdir = "tb_TL" json_out_config = "json_transfer_learning.json"

[parameters] num_epochs = 120 save_every_n_epochs = 5 batch_size = 256 num_refs = 0 sample_batch_size = 512 learning_rate = 0.0005 seed = 42 gradient_clipping = 1.0

weight_decay = 0.01
dropout = 0.25

input_model_file = "Chembl34_filtered.model" smiles_file = "Chembl34_filtered_train.smi" output_model_file = "Chembl34_filtered.prior" validation_smiles_file = "Chembl34_filtered_validation_compounds.smi"

[parameters.lr_scheduler] type = "ReduceLROnPlateau" factor = 0.5 patience = 5
min_lr = 1e-6

log_interval = 10

[validation] smiles_file = "Chembl34_filtered_validation_compounds.smi"

start_time = "2024-06-05 00:00:00"

halx commented 1 week ago

Your config suggests to me that you have modified the code and I can't comment on that. It would be better to show graphs. What we typically see is that Reinvent prior training shows a minimum after 5-10 steps with or without initial randomization. If you randomize SMILES every step then convergence will be much slower which seems reasonable give that the input data changes in every step. You seem to use dropout of 25% (?) which I woudl expect to have a similar effect.

Eventually you will need to assess the quality of the newly generated prior with key charactersitics like validity, duplicates, novelty, chemical space coverage, etc. There are several benchmarks out there that should help. In practice though, you would not use the prior directly but apply TL or RL for further refinement.

kingljy0818 commented 1 week ago

A_Mean NLL loss B_Fraction valid SMILES

Thank you for your response. I have uploaded the A_Mean NLL loss and B_Fraction valid SMILES graphs. Based on the results shown in the attached curves, do you think this training has converged? Is the generated Chembl34_filtered.prior model reasonable and suitable for subsequent transfer learning? Are there any parameters in the following code that could still be optimized?

run_type = "transfer_learning" device = "cuda:0" tb_logdir = "tb_TL" json_out_config = "json_transfer_learning.json"

[parameters] num_epochs = 120
save_every_n_epochs = 5 batch_size = 256 num_refs = 0 sample_batch_size = 512 learning_rate = 0.0005 seed = 42 gradient_clipping = 1.0

weight_decay = 0.01
dropout = 0.25

input_model_file = "Chembl34_filtered.model" smiles_file = "Chembl34_filtered_train.smi" output_model_file = "Chembl34_filtered.prior" validation_smiles_file = "Chembl34_filtered_validation_compounds.smi"

[parameters.lr_scheduler] type = "ReduceLROnPlateau" factor = 0.5 patience = 5
min_lr = 1e-6

log_interval = 10

[validation] smiles_file = "Chembl34_filtered_validation_compounds.smi"

start_time = "2024-06-05 00:00:00"

halx commented 1 week ago

Well, I can only restate what I have said already., You appear to have made changes to the code. It is unclear what you have done, what the modivation for that is and what you are expecting to get from this.

The loss plots are merely a guideline to have a rough idea whether you are in an overfitting or underfitting regime. But the practical problem is to find a reasonable balance between a model that is too input-like and a model that does not well capture the input chemical space. So, you will need to define what your expectations in the model are, as said before, and benchmark for this for various/all checkpoints of your TL run. I refer you to Randomized SMILES strings improve the quality of molecular generative models for more insight into prior training.

kingljy0818 commented 2 days ago

Hello, I have carefully read the article you suggested, "Randomized SMILES strings improve the quality of molecular generative models". If I want to train ChEMBL34 into an excellent prior model, is it better to convert its SMILES format file using RDKit into a Restricted random atom order rather than an Unrestricted random atom order? The conclusion of the article states: "The randomized SMILES variant that gave the best results is the one that has restrictions, compared to the one that is able to generate all possible randomized SMILES for each molecule." Therefore, when performing transfer learning based on the ChEMBL34 prior model, should the training set's SMILES format file also be converted into a Restricted random atom order using RDKit?

BTW，I use the following code to convert the smiles format file of the chemble34 database into Restricted random atom order. Is the code written reasonably?

import random from rdkit import Chem from concurrent.futures import ProcessPoolExecutor, as_completed

def randomize_smiles(smiles, num_variants=10): """Generate multiple randomized SMILES strings using restricted random atom order.""" mol = Chem.MolFromSmiles(smiles) if mol is None: return []

randomized_smiles_list = []
for _ in range(num_variants):
    randomized_smiles = Chem.MolToSmiles(mol, doRandom=True, canonical=False)
    randomized_smiles_list.append(randomized_smiles)

return randomized_smiles_list

Read the SMILES strings from the ChEMBL34 database (assuming they are stored in a .smi file, one SMILES string per line)

input_file = 'chembl34.smi' with open(input_file, 'r') as file: smiles_list = [line.strip() for line in file if line.strip()]

Generate multiple randomized SMILES using parallel processing

randomized_smiles_list = [] with ProcessPoolExecutor(max_workers=8) as executor: # Adjust max_workers to match the number of your CPU cores futures = {executor.submit(randomize_smiles, smiles): smiles for smiles in smiles_list} for future in as_completed(futures): randomized_smiles = future.result() if randomized_smiles: randomized_smiles_list.extend(randomized_smiles)

Output the randomized SMILES

output_file = 'chembl34_randomized.smi' with open(output_file, 'w') as file: for randomized in randomized_smiles_list: file.write(f"{randomized}\n")

After running the above code, consider the first 10 lines of the generated SMILES format as an example. Do you think the randomized SMILES generated by the above code belong to the Restricted random atom order?

c1c(CNCCNCc2ccccc2)cccc1 C(c1ccccc1)NCCNCc1ccccc1 c1ccccc1CNCCNCc1ccccc1 c1cc(CNCCNCc2ccccc2)ccc1 C(NCc1ccccc1)CNCc1ccccc1 c1cccc(CNCCNCc2ccccc2)c1 c1c(CNCCNCc2ccccc2)cccc1 c1cccc(c1)CNCCNCc1ccccc1 c1c(cccc1)CNCCNCc1ccccc1 c1ccc(CNCCNCc2ccccc2)cc1

Also, I have another question，ChEMBL34 has 1,953,573 lines of SMILES. Is it necessary to generate 10 randomized versions for each?

I look forward to your answer. Thank you!

halx commented 1 day ago

I believe unrestricted SMILES would require direct code changes in RDKit. I am not aware that there is an option for that.

The paper is quite clear on that the restricted randomization should be used. This seems rather obvious given that unrestricted SMILES can lead to deeply nested SMILES and would probably also have a much larger number of possible SMILES. There is a reason why the RDKit developers have developed this optimization.

I believe our script has options for randomization, both for inital epoch only and every epoch randomization The latter is, naturally, much for expensive but the paper shows that the former is sufficient to improve prior performance over non-randomized SMILES.

ChEMBL 34 has about 2.4 milion SMILES so I suppose your number is after filtering. How many randomized SMILES to generate is rather hard to answer. One could argue: all possible variants. But the number of variants will depend on the length of the SMILES.

kingljy0818 commented 21 hours ago

Thank you for your response. I have a few more questions to ask you. How do I use the randomization options in the REINVENT 4.3 script? When should I use initial epoch randomization, and when should I use every epoch randomization?

After filtering, my ChEMBL34 dataset has 1,953,573 SMILES. Using RDKit, I generated 10 SMILES for each entry in ChEMBL34 based on the restricted random atom order, resulting in a new ChEMBL34 dataset with 19,535,730 SMILES. When using this new ChEMBL34 dataset to generate a prior model in REINVENT, do I need to enable the randomization options in the REINVENT script? I am eagerly awaiting your further guidance.

halx commented 12 hours ago

The paper discusses what impact the two different randomization strategies have so you would need to decide yourself which one you want to use. Our script either does one randomization once at the start or randomization in every epoch. Only one randomization per SMILES is generated.

If you have precomputed random SMILES there is no real reason to use the randomization of the script.

MolecularAI / REINVENT4

Evaluation and Optimization Inquiry for Chembl34_filtered.prior Model Training Results in Reinvent4.3 #100

Read the SMILES strings from the ChEMBL34 database (assuming they are stored in a .smi file, one SMILES string per line)

Generate multiple randomized SMILES using parallel processing

Output the randomized SMILES