The charset - Githubissues

hsiaoyi0504 commented 7 years ago

As I proposed in maxhodak/keras-molecules#54. I am interested in why the charset is designed like this. It's not straightforward. From the viewpoint of chemistry, the chlorine "Cl" should not be treated as "C" and "l". Maybe it will be some improvement if we re-design the charset. I used the implementation from keras-molecules, and when I tried to interpolate between 2 chemical structures (CC=C(C(=CC)c1ccc(O)cc1)c1ccc(O)cc1 and CN1C(=O)CCS(=O)(=O)C1c1ccc(Cl)cc1). ). I got something like these invalid structures below, so I guess the charset is the reason for this. CC(C)(O)CCC1CCC(Cr)So2c1ccc(C)cc1 CCNC(=O)CN(CC1((l)CN1c1ccc(OC)cc1 CN1C(=O)CN(CC1((#)CN1c1ccc(OC)cc1 CN1C(=O)CC(CC()(=O)C1c1ccc(Cl)cc1 CN1C(=O)CC(NC()(=O)C1c1ccc(Cl)cc1

duvenaud commented 7 years ago

Great suggestion. Yes, SMILES is clearly suboptimal for this reason. The molecular autoencoder would almost certainly work better if we used a modified language that had fewer opportunities to produce invalid strings.

jmhernandezlobato commented 7 years ago

Dear Hsiao Yi,

you may find relevant the following paper that we have submitted very recently to the arxiv:

https://arxiv.org/abs/1703.01925

By using a grammar and building the variational autoencoder on the production rules of that grammar we avoid some of the problems that you mention.

Miguel.

On Tue, Feb 28, 2017 at 8:14 PM, hsiao yi notifications@github.com wrote:

As I proposed in maxhodak/keras-molecules#54 https://github.com/maxhodak/keras-molecules/issues/54. I am interested in why the charset is designed like this. It's not straightforward. From the viewpoint of chemistry, the chlorine "Cl" should not be treated as "C" and "l". Maybe it will be some improvement if we re-design the charset. I used the implementation from keras-molecules, and when I tried to interpolate between 2 chemical structures (CC=C(C(=CC)c1ccc(O)cc1)c1ccc(O)cc1 and CN1C(=O)CCS(=O)(=O)C1c1ccc(Cl)cc1). ). I got something like these invalid structures below, so I guess the charset is the reason for this. CC(C)(O)CCC1CCC(Cr)So2c1ccc(C)cc1 CCNC(=O)CN(CC1((l)CN1c1ccc(OC)cc1 CN1C(=O)CN(CC1((#)CN1c1ccc(OC)cc1 CN1C(=O)CC(CC()(=O)C1 c1ccc(Cl)cc1 CN1C(=O)CC(NC()(=O)C1c1ccc(Cl)cc1

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/HIPS/molecule-autoencoder/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/ABndalP7TtYxcN8-1sFXRDruGMOAp52tks5rhIARgaJpZM4MO2Ge .

yangxiufengsia commented 7 years ago

Hi, I tried to find the code of bayesian optimization used in this paper. But it seems the code not included. Will you plan to share the code of bo?

yangxiufengsia commented 7 years ago

I tried use the bayesian optimization to find the better molecules. But when use BO search in the 292 space, I alwasy got invalid smiles same like Hsiao Yi got , so I guess this might be caused by the way to chose inducing point , right?

duvenaud commented 7 years ago

You were doing BayesOpt in a 292-dimensional space? We were already having a hard time with a 56D space. One thing you might want to look at are the lengthscales of each dimension - we found that they were often very long, and that the GP was basically just doing linear regression.

jmhernandezlobato commented 7 years ago

I will try to upload the code for Bayesian optimization by next week. In our experiments we obtained a large number of invalid smiles. At each point, we decoded a large number of smiles (500) and from those, we only kept the valid ones.

yangxiufengsia commented 7 years ago

Thank you very much for answering my questions. Yes, I tried 292 dimensions by using GpyOpt. For the lengthscale of each dimension, I use [-1,1], I guess this lengthscale might not be correct. I look forward to your BO code.

abhik1368 commented 7 years ago

Can you suggest why we are using 292 space . What's the logic behind it ?

HIPS / molecule-autoencoder

The charset #1