aspuru-guzik-group / ORGANIC

Code repo for optimizing distributions of molecules.
GNU General Public License v3.0
128 stars 63 forks source link

Reusei the saved model #1

Closed sweetkurapika closed 6 years ago

sweetkurapika commented 7 years ago

I was reading the research paper of ORGANIC and I was trying to use the available source code. After I installed the requirements, I successfully ran the program. The model was saved in a folder called "tutorial1", and now I would like to reuse it as a standard user. I mean, I would like the model to accept an input (e.g. Cc1ncnno1) and produces similar outputs. How can I do that? If there is another simple program on how to reuse it, I hope you can share with me. Thank you in advance and I appreciate your help.

couteiral commented 7 years ago

Hi @sweetkurapika

That's an easy one: you have to use the 'load_prev_training' function. This is a self-explanatory example:

from organic import ORGANIC

model = ORGANIC('your_name')
model.load_training_set('../data/trainingsets/toy.csv')
model.load_prev_training('tutorial1/name_of_your_trained_ckpt')

After that, you just continue with the usual commands (set_training_program and so on).

Hope this is useful, Carlos

sweetkurapika commented 6 years ago

Thank you for your help. I followed your instructions, and I wrote this example:

from organic import ORGANIC
import mol_methods as mm
from collections import OrderedDict

params = {
    'MAX_LENGTH':     16,
    'GEN_ITERATIONS':  1,
    'DIS_EPOCHS':      1,
    'DIS_BATCH_SIZE': 30,           # DISCRIMINATOR Batch Size
    'GEN_BATCH_SIZE': 30,           # Generator Batch Size
    'GEN_EMB_DIM':    32,
    'DIS_EMB_DIM':    32,
    'DIS_FILTER_SIZES':[  5,  10,  15],
    'DIS_NUM_FILTERS': [100, 100, 100]
}

model = ORGANIC('Tutorial 1', params = params)

data = '../data/trainingsets/toy.csv'
ckpt = 'checkpoints/tutorial1/checkpoints/tutorial1/tutorial1_0.ckpt'
model.load_training_set(data)
model.load_prev_training(ckpt)
model.set_training_program(['logP'], [5])
model.load_metrics()

results = OrderedDict({'exp_name': 'Tutorial 1'})
results['Batch'] = 30
train_samples = mm.load_train_data(data)
char_dict, ord_dict = mm.build_vocab(train_samples)
gen_samples = model.generate_samples(30)
mm.exa_compute_results(gen_samples, train_samples, ord_dict, results)

Is this code correct? As you may notice that I am using the same training data in 'load_training_set' and 'load_train_data' which is in fact not my objective. My objective is to introduce new molecules and see if the program can generate similar ones or not. From the above code, I got the following result:

Total samples   :     30
Unique          :     30 (100.00%)
Unverified      :      8 (26.67%)
Verified        :     22 (73.33%)

Good samples:
~~~~~~~~~~~~
C#CC(C#N)C=O
N#Cc1cnc(F)cn1
C#Cc1[nH]nnc1O
N#CC#CC(=O)C#N
O=C1COC1C=O
CC1CN1
[NH]C1=CNOC1C=O
C#CC(=O)n1cnnn1
Fc1cccnc1F
c1c[nH]c(=O)o1
O=c1onco1
Nc1nc(=O)nco1
C#CC#CC(C)=O
N#Cc1ccc[nH]1
O=COc1nnoc(=O)n1
O=c1[nH]nnc(O)n1
C#Cc1ocnn1
Oc1ncccn1
Fc1ccncn1
Nc1n[nH]nc1O
N#CC(=O)c1ccon1
N#CC(O)C=O

Bad samples:
~~~~~~~~~~~
O=Cn1nonc1=O
N#Cc1nc(C=O)[nH]
C#CC#Cc1nnocn1
Nc1onc2nnon12
N#Cc1cnc(C#N)no1
c1nc2[nH]ncnn1
C#CC1CN1C1
O=c1onc(O)no1

As I said my objective is to only provide one single SMILES string "Oc1cc(CNN)ccc1" and get similar molecules generated by ORGANIC model. And Also, in case I would like the similar molecules to have some common features, how can we possibly do that with ORGANIC model? I appreciate your help

couteiral commented 6 years ago

Hi @sweetkurapika.

The ORGANIC model does not work in that way. You cannot "feed" a molecule to a trained model and expect that it will generate similar molecules. The only moment when the generator interacts with the training set is at the beginning, in the pre-training step, when it learns the initial distribution from this data; from then on, no other training set information is given.

Maybe you can try to employ the experimental substructure_match_all and substructure_match_any included in the custom_metrics.py file. In this way, you can train with a set of molecules with similar characteristics to yours, and then impose to ORGANIC the metric that rewards the presence of certain features.

If this is what you are looking for, please write to me and I will give you more information.

Regards, Carlos

xuzhang5788 commented 6 years ago

@couteiral I have the same needs as @sweetkurapika. Is it possible to create a sample for us to follow? About a set of molecules, how many molecules do we need to prepare for training? If it is too small, such as two or five, will be there any problems of overfitting when training the model? Thanks a lot.