aspuru-guzik-group / chemical_vae

Code for 10.1021/acscentsci.7b00572, now running on Keras 2.0 and Tensorflow
Apache License 2.0
470 stars 178 forks source link

possible issue with random sampling #15

Open ga01 opened 5 years ago

ga01 commented 5 years ago

Hi guys,

I have encountered the following "weird" behaviour when I sample the latent space near molecules near a SMILES: the output molecules somehow change little with the specified noise level. My installation seems to be okay, it reproduces the examples (I'm using a CPU based installation), so I wonder whether I am missing something. I provide below some examples, but it is the case for many other molecules. (For the cases here I take only 100 samples, but for "production" work I take tens of thousands, and the pattern remains)

Noise 200: $ python get_vae_smiles.py "CSCC(=O)NNC(=O)c1c(C)oc(C)c1C" 2>/dev/null Using standarized functions? True Standarization: estimating mu and std values ...done! Input : CSCC(=O)NNC(=O)c1c(C)oc(C)c1C Reconstruction : CSCC(=O)N(C(=O)c1c(C)oc(C)c1C Z representation : (1, 196) with norm 10.705 Searching molecules randomly sampled from 200.00 std (z-distance) from the point Found 10 unique mols, out of 30 SMILES 0 CSCC(=O)NNC(=O)c1c(C)oc(C)c1C 1 CSC(C=O)NNC(=O)c1c(C)oc(C)c1C 2 COCC(=O)NC(C=O)c1c(C)oc(C)c1C 3 CSCC(=O)NCC(=O)c1c(C)oc(C)c1C 4 COCC(=O)NCC(=O)c1c(C)oc(C)c1C 5 CSC(C=O)NCC(=O)c1c(C)oc(C)c1C 6 COCC(=O)NCC(=O)c1c(C)oc(C)c1Cl 7 CSC(C=O)NCC(=O)c1c(F)oc(C)c1C 8 COCC(=O)NC(=O)c1cc(O)nc(C)c1C 9 C#COC(=N)NC(=O)c1ccccc(Cl)cc1Cl Name: smiles, dtype: object

Noise 2: Searching molecules randomly sampled from 2.00 std (z-distance) from the point Found 13 unique mols, out of 75 SMILES 0 CSCC(=O)NNC(=O)c1c(C)oc(C)c1C 1 CSC(C=O)NNC(=O)c1c(C)oc(C)c1C 2 CSCC(=O)NC(C=O)c1c(C)oc(C)c1C 3 COCC(=O)NC(C=O)c1c(C)oc(C)c1C 4 CSCC(=O)NCC(=O)c1c(C)oc(C)c1C 5 COC(C=O)NNC(=O)c1c(C)oc(C)c1C 6 COCC(=O)NCC(=O)c1c(C)oc(C)c1C 7 CSCC(=O)NCC(=O)c1c(O)oc(C)c1C 8 CSC(C=O)NCC(=O)c1c(C)oc(C)c1C 9 CSC(C=O)NCC(=O)c1c(F)oc(C)c1C 10 COC(C=O)NCC(=O)c1c(C)oc(C)c1C 11 CSCC(=O)N/C(=O)c1c(C)oc(C)c1C 12 ClCC(=O)NCC(=O)c1c(C)oc(C)c1C Name: smiles, dtype: object

Searching molecules randomly sampled from 50.00 std (z-distance) from the point Found 14 unique mols, out of 65 SMILES 0 CSCC(=O)NNC(=O)c1c(C)oc(C)c1C 1 COCC(=O)NNC(=O)c1c(C)oc(C)c1C 2 CSC(C=O)NNC(=O)c1c(C)oc(C)c1C 3 COCC(=O)NC(C=O)c1c(C)oc(C)c1C 4 CSCC(=O)NCC(=O)c1c(C)oc(C)c1C 5 CSC(C=O)NC(C=O)c1c(C)oc(C)c1C 6 COCC(=O)NCC(=O)c1c(C)oc(C)c1C 7 CSC(C=O)NCC(=O)c1c(C)oc(C)c1C 8 CSC(C=O)NCC(=O)c1c(F)oc(C)c1C 9 COC(C=O)NCC(=O)c1c(C)oc(C)c1C 10 CSCC(=O)N/C(=O)c1c(C)oc(C)c1C 11 ClC(C=O)NCC(=O)c1c(C)oc(C)c1C 12 ClCC(=O)NCC(=O)c1c(C)oc(C)c1C 13 ClCC(=O)NC(C=O)c1c(C)oc(C)c1C Name: smiles, dtype: object

So it seems that for large Z distances the SMILES are not so much different than for small distances. )What is the distribution of the random sampling? I would expect this if the random sampling is not uniform and heavily biased towards the coordinates of input SMILES, so the specified noise level affects only the peripheries, and most molecules of the output still originate from the close neighbourhood of the SMILES.

I would greatly appreciate any help with this issue.

Best wishes, Gyorgy Abrusan

jnwei-zz commented 5 years ago

The average distance between molecules is ~20. The molecules are distributed as you said, close to the SMILES of the training set molecules.

My guess is that setting the noise to a very large value makes it difficult to find valid SMILES that are correct. Bumping @beangoben to see if he has any ideas.

beangoben commented 5 years ago

hi Abrusan,

I think the problem might be that the noise level is too high. Searching for molecules from random vectors that are 50-200 STD (z-distance wise) is huge. Each dimension is assumed to be gaussian distributed..so the actual probability mass outside of 2-4 STD should be quite small (https://arxiv.org/abs/1609.04468). If i had to guess, I would think the RNN is just decoding whatever it could make sense from the first molecule in the batch.

I think you will find more subtle differences and a larger variety if you sample with 0.5 (local neighborhood), 1.0-2.0 (random molecules). Also not sure if this is some effect of the decoder as coded. An extra repo for generative molecules you can also test out and has a pytorch impletation is https://github.com/molecularsets/moses

ga01 commented 5 years ago

Hi guys,

Thanks for the comments.

I have to admit I still think something critical is missing. I used several noise levels: 3, 6, 12, 25, 50, 100 (but also tried 0.1, and even 200; [noise=N, df = vae.z_to_smiles(z_1,decode_attempts=100,noise_norm=noise)]). My aim was to have a gradient between noise levels that sample the close neighborhood of a SMILES, and between noise levels that effectively pick SMILES randomly from the entire latent space. To my surprise, dramatic increases in the specified noise levels lead to rather modest increases in the diversity of returned molecules - I have not reached a noise level that effectively returns SMILES that are structurally unrelated to the input. (In other words - what noise level should I use to sample the entire latent space, essentially randomly?)

So it seems that in practice the relationship between noise level and the structural diversity of the returned smiles is rather nontrivial, which is surprising, given the perturb_z (vae_utils.py) function (but I am not a python programmer). It would be great if you could clarify this.

Best wishes, Gyorgy

ga01 commented 5 years ago

Hi guys, just one more comment. I wonder whether the solution is due to an error in my assumption, that by increasing the noise level I can reach a noise level that will result in SMILES randomly picked from the latent space. The perturb_z function basically adds and amplifies gaussian noise to the Z vector. However, if this distorts Z in qualitatively different ways from valid SMILES (i.e.de differences between valid Z vectors are not normally distributed), than only a small fraction of perturbed vectors - the ones closest to the input vector - will produce valid smiles. In other words, by adding and amplifying gaussian noise it is not possible to reach a noise level that produces valid smiles that sample the entire latent space. This is good - makes the VAE robust - but makes my original goal impossible.

Best wishes, Gyorgy