Rose-STL-Lab / LIMO

generative model for drug discovery
59 stars 14 forks source link

Similarity-constrained Penalized logP Maximization #8

Closed VincentH23 closed 10 months ago

VincentH23 commented 1 year ago

Hi, I would like to reproduce your experiment on constrained optimization of plogp, but I could not find the corresponding code in your repo (Table 3). Is it possible to get it? Thank in advance for your answer.

PeterEckmann1 commented 1 year ago

Hi,

We don't have code for exactly that task in this repository, but it should be relatively easy to replicate. You can use generate_molecules.py with --prop penalized_logp, but change the initial z, which is usually randomly generated like this: https://github.com/Rose-STL-Lab/LIMO/blob/df6232a0ae67e8490d19a25a5acaf87e546543d5/generate_molecules.py#L25

However, you can set it to instead be the result of the smiles_to_z function in utils.py, and then perform optimization starting from that z:

https://github.com/Rose-STL-Lab/LIMO/blob/df6232a0ae67e8490d19a25a5acaf87e546543d5/utils.py#L85

This function will take a SMILES string, which if you want to replicate the similarity constrained logp task, will be the 800 lowest plogp in ZINC250k. After you optimize these molecules, recreating the results in the table is just a matter of only including molecules above the defined similarity threshold, and then calculating the metrics from there. Hopefully that helps, but let me know if you want more code examples or need any other help!

VincentH23 commented 1 year ago

Thanks for your answer. Do you have the code for Mask creation ? I used the code that you share here https://github.com/Rose-STL-Lab/LIMO/issues/6 but I don't know which substructure I should keep for each molecule in this task

PeterEckmann1 commented 1 year ago

The task you asked about in this issue, similarity-constrained plogp maximization, is slightly different than the task involving masking, which is substructure-constrained optimization. In the similarity-constrained setting, we just want to find molecules that are within a certain Tanimoto similarity to the starting molecule. In the substructure-constrained setting, we don't care about the Tanimoto similarity, we just want to keep a certain substructure the same. Does that help clarify a bit?

VincentH23 commented 1 year ago

So for this task, you didn't use the l2 loss in the optimization process ?

PeterEckmann1 commented 1 year ago

Correct, no L2 loss was used for the similarity-constrained optimization task.

VincentH23 commented 1 year ago

Hi, I tried to reproduce your results from table 3. but I have very different results. Here are my results [0,0.2,0.4,0.6] => [100%,88.875%,73.625%,50.375%] =>(improvement) [6.12297867,3.5330558,2.77459332,2.3263943 ]

I have a coefficient R =0.59 for the model and I used lr = 0.1. Did you use another lr value for this task ?

PeterEckmann1 commented 1 year ago

It's hard to say, but did you compute these metrics on the final set of generated molecules? That's the first reason I can think of why your success rates are lower.

For the paper, we optimized the entire set of molecules, but took the final molecule from each training "trajectory" before the molecule crossed the similarity threshold. So, for example, if we did 200 optimization steps, but the molecule became too dissimilar at the 100th iteration, we took the molecule from the 99th iteration and computed metrics on that one. If you already did that, maybe try playing around with the lr... I think we used lr=0.1 for the paper, but there could be a better value.

Finally, your R value seems okay, but maybe that could be a bit better if you increased the number of molecules you trained on. Maybe try running train_property_predictor.py with an even greater set of molecules?

VincentH23 commented 1 year ago

I have calculated the metric on all the elements of the trajectory. For each molecule, I select the best molecule of the trajectory for each threshold (similar to JT-VAE).

VincentH23 commented 1 year ago

I think the problem comes from my z vector. I use smiles_to_z to get my z-vectors but I notice that the similarity between the decoded molecules and the original molecules is sometimes very low. Do you have any other tips to get the z vectors ?

PeterEckmann1 commented 1 year ago

Sorry for the late response on this... could you share the code you're using? I don't have any great ideas on how to produce better z vectors, but maybe if I can see what you have so far I can offer some pointers.