Reproducibility - Githubissues

hyeonahkimm commented 2 months ago

Hi,

I have tried to reproduce the experiments in the paper using REINVENT. Based on the jupyter notebook and code (the provided reinvent-benchmarking GitHub repo). Since minor errors occurred, I slightly modified the code (mostly related to the custom scoring function).

Nevertheless, I still have issues running codes (I only change the batch size from 500 to 100),

The score evaluation time is too long, even with multi-processing. For example, I couldn't evaluate one step (batch size = 100) for 12 hours when I tested on PCE and TADF tasks. Is it normal? There is no reported evaluation time in the paper.
There is a trained surrogate model for PCE tasks in the README, but it is not available. If the exact PCE evaluation is too expensive (which seems to be true), the surrogate model will be useful. Could you update this?
Invalid score (-10000) only in docking tasks. FYI, I have installed Open Babel.
I am also unsure about REINVENT's parameter setting. What hyperparameters are used to produce the results in the paper?

ps. I tested using AMD EPYC 7542 32-Core Processor (128) CPUs.

It'll be greatly helpful if you share the codes to reproduce the result in the paper.

Thanks,

gkwt commented 2 months ago

Hello @hyeonahkimm ,

The evaluation time should not be taking that long. Are you able to run the fitness function for a singular smiles string and get the results? And if you encounter any errors, please report them here.
The surrogate model has been removed because it was not accurate when compared to the evaluation itself, and not a useful metric optimize (we will update this in the repo). We recommend troubleshooting the scoring function. If you want to debug the REINVENT model, you can setup a dummy optimization function that returns a random number, or something similar.
Which protein are you evaluating? Are you on a Linux machine? You may have to give executable permission for the binaries of qvina and smina. Instructions are given in the README.md of the repo.
The hyperparameters set in the reinvent-benchmarking repo are the ones used in the manuscript.

Gary

hyeonahkimm commented 2 months ago

Thanks for the quick response.

When I run example.py, there is no error, but it does not finish. I found that the following computation has an issue (in the pce.get_properties). command = 'CHARGE={};xtb {} --opt normal -c $CHARGE --iterations 4000 > out_dump'.format(charge, 'crest_best.xyz') system(command) I have a similar issue in tadf (in tadf.xtb()) - the process is not finished. I might have missed some settings related to xtb (I sat environment variables XTBHOME following the local installation guideline in docs/getting_started.rst.

For docking and reactivity tasks, I've encountered the following errors. FYI, during docking evaluation, lig files are properly generated and removed, while pose files are not generated.

I guess REINVENT code works well with log_p score function, but I'll try it with random customized score functions as you recommended.
I've tested 1syh and 6y2f with qvina using a Linux machine (Ubuntu 22.04). And I ran chmod 777 tartarus/data/qvina and smina.

hyeonahkimm commented 2 months ago

I figured out the issue in pce and tadf - it was because of the wrong crest installation. crest_best.xyz was empty, but now it is properly generated. Now, they take 212 sec. and 22 sec. for a single evaluation (example.py)

Results

PCE1: -3.908 PCE2: -7.605 Singlet-triplet: -0.840 Oscillator strength: 0.019 Combined obj: -2.161

akshat998 commented 2 months ago

Hi @hyeonahkimm,

I highly recommend reviewing the Methods Overview section of the manuscript. Specifically, this paragraph is important; however, please read the entire section for further clarification:

When using TARTARUS, the following procedures should be adopted to obtain benchmark results
that are consistent with the ones provided herein. The first step for running one of the benchmarks, if
necessary, is to train the generative model on the provided dataset. For all the ML models, we used
the first 80% of the reference molecules for training and the remaining 20% for hyperparameter
optimization. Then, the (trained) model is tasked with proposing structures to be evaluated by the
objective function of the corresponding benchmark task. Notably, structure optimization was always
initiated using the best reference molecule from the corresponding dataset. For the benchmarks
concerned with designing photovoltaics, organic emitters, and protein ligands, structure optimization
was carried out with a population size of 500 and a limit of 10 iterations, leading to a maximum
number of 5,000 proposed compounds overall. For the design of chemical reaction substrates, we
used the same maximum number of proposed compounds but used a population size of 100 and
limited the number of iterations to 50 instead. Additionally, the associated run time was limited to
24 hours, which resulted in termination for several molecular design runs before reaching 5,000
molecule evaluations. Furthermore, to increase robustness and reproducibility of our results, we
repeated each optimization run five times, allowing us to report the corresponding outcomes with both
an average and a standard deviation. We believe that this resource-constrained comparison approach
is necessary for fairly comparing methods and should be used as a standard by the community. A
detailed account of the parameters and settings used for running each of the models is provided in the
Computational Details section of the Supporting Information.

Additionally, it seems that the calculations are being performed during training. Please correct me if I am mistaken. Note that this approach is incorrect and can significantly increase the training time for any of the models. The evaluation (with calls to the specific tasks) are only meant to be after completion of the training procedure, as highlighted in the manuscript.

akshat998 commented 2 months ago

Regarding the docking objective, I strongly suspect that the QuickVina2 executable is not working or that the molecule generated is extremely unstable/infeasible, preventing successful calculations. Could you please try the following:

Run ./qvina in the directory containing the executable and let us know what output you receive?
For debugging purposes, we provide an example molecule in docking.py. Please run the file as is and inform us of the output. The molecule that should run perfectly is C1=NC2=C(N=C1)C(=CC=N2)C1=CC=NC2=C1N=CN=C2.

hyeonahkimm commented 2 months ago

Thanks for all the help.

Training process Thanks for sharing the guidelines. To run reinvent, I'm using the pretraining code from reinvent-benchmarking with the provided dataset, and it seems to follow the same procedure (80% for training and 20% for validation) The results in my previous comment were obtained by running the provided example file, not from training.
docking objectives I tried to run ./qvina directly and found the file directory issue (this error was not printed when I ran the provided example.py because of exception handling).

I addressed the error by changing the receptor file directory in docking.py. ./docking_structures/1syh/prot.pdbqt -> ./tartarus/docking_structures/1syh/prot.pdbqt

Now, I can see that pose files are properly generated (and removed) and scores are returned. C1=NC2=C(N=C1)C(=CC=N2)C1=CC=NC2=C1N=CN=C2 qvina 1syh docking score: -5.9 smina 1syh docking score: -5.5

hyeonahkimm commented 2 months ago

I appreciate your quick and kind responses.

aspuru-guzik-group / Tartarus

Reproducibility #11

Results