aspuru-guzik-group / selfies

Robust representation of semantically constrained graphs, in particular for molecules in chemistry
Apache License 2.0
655 stars 126 forks source link

question about output molecules #36

Closed fylinhub closed 3 years ago

fylinhub commented 3 years ago

Hi,

I have a general question about your code for VAE or GAN. I've run with your codes, and the output molecules are very diverse. To generate molecules similar to a query molecule, I'm wondering if there is a way to make the output molecules being similar to the query molecule, perhaps by changing some parameters for decoding or sampling? (instead of using the similarity criteria as a filter)
Thank you!

MarioKrenn6240 commented 3 years ago

Hi fylinhub, it is actually a feature of SELFIES that generative models can learn such a large diverse number of molecules.

If you are interested in similar structure as a given molecule, you can encode the molecule and get a location z0 in the latent space. Then molecules around z0, i.e. z0+epsilon, should be rather similar to the initial molecule.

If this doesn't work, you can go to other techniques, for instance add specific objectives to your loss function during the training. Or you go to different models where the similarity is easier to control, such as genetic algorithms (https://github.com/aspuru-guzik-group/GA).

I hope this helps.

fylinhub commented 3 years ago

Hi Mario,

Thank you for your advice! I'll look at your GA codes.

I actually find it strange when running /examples/vae_example/chemistry_vae.py script with the QM9 dataset.

It is still running, but some output message (as shown below) reports that the validity is -0.10000%. Is this normal?

Epoch: 1409, Batch: 570 / 660, (loss: 0.0003 | quality: 100.0000 | quality_valid: 99.8819) ELAPSED TIME: 1.06914 Epoch: 1409, Batch: 600 / 660, (loss: 0.0002 | quality: 100.0000 | quality_valid: 99.8667) ELAPSED TIME: 1.06615 Epoch: 1409, Batch: 630 / 660, (loss: 0.0003 | quality: 100.0000 | quality_valid: 99.8571) ELAPSED TIME: 1.09208 Validity: -0.10000 % | Diversity: -0.10000 % | Reconstruction: 99.87428 % Epoch: 1410, Batch: 0 / 660, (loss: 0.0004 | quality: 100.0000 | quality_valid: 99.8590) ELAPSED TIME: 0.03391 Epoch: 1410, Batch: 30 / 660, (loss: 0.0012 | quality: 100.0000 | quality_valid: 99.8438) ELAPSED TIME: 1.07213

Also, as I tried to see how the molecules are generated, and just simply modify the "latent_space_quality" function to return the "all_correct_molecules" as coded in the original script.

def latent_space_quality(vae_encoder, vae_decoder, type_of_encoding, alphabet, sample_num, sample_len): total_correct = 0 all_correct_molecules = set() print(f"latent_space_quality:" f" Take {sample_num} samples from the latent space")

for _ in range(1, sample_num + 1):

    molecule_pre = ''
    for i in sample_latent_space(vae_encoder, vae_decoder, sample_len):
        molecule_pre += alphabet[i]
    molecule = molecule_pre.replace(' ', '')

    if type_of_encoding == 1:  # if SELFIES, decode to SMILES
        molecule = sf.decoder(molecule)

    if is_correct_smiles(molecule):
        total_correct += 1
        all_correct_molecules.add(molecule)
## return a set of correct molecules
return total_correct, len(all_correct_molecules), all_correct_molecules

Thus, the following commands would return "molecule_uqniue_set"

        corr, unique, molecule_uqniue_set = latent_space_quality(vae_encoder, vae_decoder,
                                            type_of_encoding, alphabet,
                                            sample_num, sample_len)

However, the "molecule_unique_set" shows the molecules would look like these:

"' N=O', 'CC=NN=NN=NN=NN=NN=NN=NN=NN=NN=N', 'C#CO', 'N#N', 'O=NNOOOO', 'C=C(C=C=C=C=C=C=CC)', 'OCCCC', 'C(NNN=N)', 'N=NO', 'N=NN=O', 'C(C)CCCCC', 'N=NN=NN=NN=NN=NN=NN=NN', 'N(N=NN=NN=NN=N)', 'C#1CCCCCCCC#1', 'CCNNNNN=O', 'NC#N', 'NNN=C(NNNNNNNNNNNNNN)', 'CCCCCCCC', 'C=COOO', 'N#CN=CN=CN=CN=CN=CN=CN=CN=CN=C', 'N=NN(N=NN=NN=NN=NN=NN=NN=N)', 'CCOO', 'CCCC(C)', 'O=COC#CC#CC#CC#CC#CC#C', 'N(F)', 'C#CC', 'C=NN', 'N=NN=NN=NN=NN=NN=N', 'N=C=NN=NN=NN=NN=NN=NN=NN=NC=C=C=C', 'C=C=C=C=C=C=C=C=C=C=C=C=C=C=C=C=C=C=C=C=C', 'CCCCCCCCCCCC=O', 'C=C=C=C=C=C=C=C', 'NNN', 'CCCC=NN=NOOCC(CCCCCCC)', 'NCC=NN', 'O', 'OOOOOOOO', 'C=NCCCCC', 'FCCF', 'NCCC', 'N=NN=NN=NN=NN=NN=NC=C=C=C', 'C#N', 'O=NN', 'OOOOOOOOOOOO', 'OOOOOOOOOOOONC=NN=N', 'COC', 'N=NN=NN=NN=NN=N', 'FC=O', 'C=NN=NN=N', 'CCCC=NN=NN=NN(CCC)', 'CC=C', 'N(CCCC)', 'C#CC#CC=O', 'C#CC=O', 'N=NN=NN=NN=NN=NNC(N)', 'N#CC#CC#C', 'CCC=NN=N', 'NNNNF', 'OC=CC', 'OOOOOOOOOOOOO', 'C=1CC=1', 'O=NN=N', 'OCCC1C=C1', 'N(O)OOOOOOOOOOOOO', 'CC=N', 'C(C)CCCCCC', 'N=CN=O', 'FC=NO', 'FN=NN=NN(F)', 'N=C=C=C=C(NNNN=NN=NN=NN=NN=N)', 'NNNF', 'N=N', 'CCCCCCCC=NN=NN=NN=NN=N', 'FN=C', 'FON', 'FNN=O', 'OCC=O', 'FC=C']"

I actually don't think molecules such as "CC=NN=NN=NN=NN=NN=NN=NN=NN=NN=N" or "N(O)OOOOOOOOOOOOO", etc. are realistic. Are these results expected when running the "chemistry_vae.py"?

Thank you in advance for reading this issue.

MarioKrenn6240 commented 3 years ago

About your first question, "Validity: -0.10000 % | Diversity: -0.10000 %". Please see chemistry_vae.py, line 337-351. For reasons of speed, the validity is only calculated if the reconstruction quality improved ('# only measure validity of reconstruction improved"), otherwise it returns -1/100% (line 345).

About the second question, why so many of the structures are chains, there could be many reaons. For example, maybe you sample in a region that is not ideal. For this, i would suggest that you find the locations of the training molecules in the latent space, and then sample in this region. (they will likely be located in a sphere around zero, maybe you sample way too far outside of the center.)

Hope this helps.

fylinhub commented 3 years ago

I see; it is like my previous question. Thank you for you advice!