connorcoley / ochem_predict_nn

MIT License
111 stars 38 forks source link

How to reproduce hybrid model with 94.6% top10 #1

Open JasonYCHuang opened 7 years ago

JasonYCHuang commented 7 years ago

Hi @connorcoley

Great work.

I can run prediction with main/output/10rxn_demo1, and would like to reproduce your result as described in the paper: hybrid model with 94.6% top10. I tried to train the model with the following command line:

$ python ochem_predict_nn/data/generate_candidates_edits_fullgrants.py
$ python ochem_predict_nn/data/preprocess_candidate_edits_compact.py
$ python ochem_predict_nn/main/score_candidates_from_edits_compact.py --hybrid=1 --baseline=1

but get following errors:

Traceback (most recent call last):
File "ochem_predict_nn/main/score_candidates_from_edits_compact.py", line 912, in <module>
train\(model, data\)
File "ochem_predict_nn/main/score_candidates_from_edits_compact.py", line 491, in train
verbose = 1,

ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 4 arrays but instead got the following list of 1 arrays: [array([[[ 0., 0., 0., ..., 0., 0., 0.],
    \[ 0.,  0.,  0., ...,  0.,  1.,  0.\],

    \[ 0.,  0.,  0., ...,  0.,  0.,  0.\],

    ..., 

    \[ 0.,  0.,  0., ...,  0.,  0.,  0.\],

Could you point out what I am missing?

connorcoley commented 7 years ago

Hi Jason,

It looks like during the final migration to the repo I forgot to pass the baseline and hybrid args to the build() function.

Also, when running the hybrid model, there's no need to include --baseline=1 - this flag ends up taking precedent over the --hybrid=1 flag and so you'll end up with a baseline fingerprint-based model instead.

Hopefully this should resolve the problem you're seeing. I was able to reproduce your error message and this patch fixed it for me. Let me know the outcome!

JasonYCHuang commented 7 years ago

@connorcoley Thanks for the fast response.

To reproduce the high accuracy described in the paper, should I give additional arguments to the command line?

For example, should I increase Maximum number of records, Number of candidate sets to read and fold, or is it enough to reproduce the result with default values?

https://github.com/connorcoley/ochem_predict_nn/blob/master/data/generate_candidates_edits_fullgrants.py#L212 https://github.com/connorcoley/ochem_predict_nn/blob/master/data/preprocess_candidate_edits_compact.py#L135 https://github.com/connorcoley/ochem_predict_nn/blob/master/main/score_candidates_from_edits_compact.py#L810

connorcoley commented 7 years ago

@JasonYCHuang , the results in the paper used a data set of 15,000 reactions (split into a 5-fold CV). Because generating candidates takes quite a long time (as you might have discovered), the mongodump already contains candidate edits. You shouldn't have to run generate_candidates_edits_fullgrants.py yourself if you're looking to reproduce results.

That said, you do need to increase the number of records in preprocess_candidate_edits_compact.py to 15,000 as you describe. The default is set to 100 so that it is easy to test the workflow on a small dataset. The exact performance reported in the paper was averaged over all five folds (setting --fold to 1, 2, 3, 4, and 5 in score_candidates_from_edits_compact.py). You should get very similar performance for each of the folds, so it should suffice to leave the --fold flag at its default value of 1. The default layer sizes match what was used in the paper.

JasonYCHuang commented 7 years ago

@connorcoley I can train the model with following command

$ python ochem_predict_nn/data/preprocess_candidate_edits_compact.py
$ python ochem_predict_nn/main/score_candidates_from_edits_compact.py  --hybrid=1

without running: data/generate_reaction_templates.py & data/generate_candidates_edits_fullgrants.py.

[1] However, when running predictions, I get another error no matter which reactants smiles inputed.

$ python ochem_predict_nn/scripts/lowe_interactive_predict.py --tag 14982201301
...
...
Enter SMILES of reactants: [CH3:14][CH:15]([CH3:16])[CH2:17][CH2:18][O:19][N:20]=[O:21].[NH2:1][c:2]1[n:3][cH:4][c:5]([CH2:6][N:7]2[CH2:8][CH2:9][CH2:10][C:11]2=[O:12])[s:13]1
...
...
Generated 136 unique edit sets
Traceback (most recent call last):
  File "ochem_predict_nn/scripts/lowe_interactive_predict.py", line 238, in <module>
    score_candidates(reactants, candidate_list, xs)
  File "ochem_predict_nn/scripts/lowe_interactive_predict.py", line 142, in score_candidates
    pred = model.predict(xs, batch_size = 20)[0]
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1554, in predict
    check_batch_axis=False)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 74, in _standardize_input_data
    '...')
ValueError: Error when checking : the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 5 arrays but instead got the following list of 4 arrays: [array([[[[ -1.027 ,   2.827 ,  26.02  , ...,   0.    ,   0.    ,   0.    ],
         [ -1.027 ,   2.827 ,  26.02  , ...,   0.    ,   0.    ,   0.    ],
         [  0.    ,   0.    ,   0.    , ...,   ...

Could you point out what I am missing?

[2] For the --mincount in the following location, should I set to 1689, or is default good enough to get high acc?

https://github.com/connorcoley/ochem_predict_nn/blob/master/scripts/lowe_interactive_predict.py#L161

connorcoley commented 7 years ago

Hi @JasonYCHuang ,

The script lowe_interactive_predict.py was written for the edit-based model. Apologies for that not being clear. This means that the function lowe_interactive_predict.preprocess_candidate_edits always returns the four input arrays used in the edit-based model. I've modified the code in this script so that you should be able to use the baseline or hybrid models (i.e., that preprocess function can return all five input arrays necessary for the hybrid model). I haven't had a chance to test this patch yet, so please let me know if this does not fix your problem.

The flag --mincount refers to the number of reaction precedents required to include a reaction template in the final template library, not the number of templates itself. The default (50) matches what was reported in the paper. If you lower this (e.g., to 25), rarer reactions will be included in the list of candidates, but the ranking will be more uncertain.

JasonYCHuang commented 7 years ago

@connorcoley

Thanks.

With the modified code, I can run predications. I am trying to reproduce high accuracy results with 15k data set. It really take a long long time.

Anyway, I will let you know whether I can reproduce it in the end.

JasonYCHuang commented 7 years ago

Hi @connorcoley

After the prediction, I get similar results as fig4-a in your publication, and have following questions. histogram test

[1] In main/output/a_timstamp_number/probs.dat, it keeps true_smiles & predicted_smiles. However, there are some difference between these 2 smiles even for training sets(acc=0.997). Does this mean: the model can predict correct classes of reactions, but there is no guaranteed that predicted smiles will be the same as true smiles?

// Here is the 1st example in probs.dat
rxn = [CH3:7][NH:8][c:9]1[cH:10][cH:11][cH:12][cH:13][cH:14]1.[O:15]=[S:16](=[O:17])([Cl:18])[Cl:19].[O:20]=[C:21]1[CH2:22][c:23]2[cH:24][cH:25][cH:26][cH:27][c:28]2[NH:29]1.[cH:1]1[cH:2][cH:3][n:4][cH:5][cH:6]1>>CN(c1ccccc1)S(=O)(=O)c1ccc2c(c1)CC(=O)N2[201]

true_smiles = CN(c1ccccc1)S(=O)(=O)c1ccc2c(c1)CC(=O)N2[201]
predicted_smiles = CN(c1ccccc1)S(=O)(=O)Cl

[2]When Enter SMILES of reactants: for predictions, is it necessary to input mapped reactants?

[3]I input reactants to predict, but the result is not the same as probs.dat. I use the reaction in [1].

$ python ochem_predict_nn/scripts/lowe_interactive_predict.py --tag 14982201301
>> Enter SMILES of reactants: [CH3:7][NH:8][c:9]1[cH:10][cH:11][cH:12][cH:13][cH:14]1.[O:15]=[S:16](=[O:17])([Cl:18])[Cl:19].[O:20]=[C:21]1[CH2:22][c:23]2[cH:24][cH:25][cH:26][cH:27][c:28]2[NH:29]1.[cH:1]1[cH:2][cH:3][n:4][cH:5][cH:6]1

>> Enter file name to save to: train1

Wrote to file ~/ochem_predict_nn/main/output/14988418531/train1.dat

The 1st rank in train1.dat is: Candidate product = C=C1C(=O)Nc2ccccc21 Probability = 0.0267125796527 , but base on probs.dat, it should be CN(c1ccccc1)S(=O)(=O)Cl

[4]Meanwhile, the probability of the 1st rank is only around 2.6%. Do you also get similar results?

I test several reactions, and they shows similar problems. Could you point out what I am missing? If you need further info, plz let me know.

connorcoley commented 7 years ago

@JasonYCHuang ,

Sorry for the delay, I was out this past weekend. To your questions,

[1] In probs.dat, the column of predicted_smiles is populated with the rank-two candidate for cases when the model correctly predicts the outcome: see the column header here. Is this the discrepancy you're referring to?

[2] It is not necessary to input mapped reactants in the interactive script

[3,4] I'm not sure how this line got lost when migrating to this repo, but apparently I dropped the load_weights part of the interactive script (obviously very important...). This should be fixed, but do let me know if that does not resolve your problem. You could add the argument by_name=True if you'd like, but that shouldn't be necessary when restoring from the same version of Keras with which the model was saved.

JasonYCHuang commented 7 years ago

Hello @connorcoley

Thanks for the feedback.

[1] I get the following error with the load_weights correction

100%|##########################################################################################################################################| 1689/1689 [00:02<00:00, 619.62it/s]
Generated 95 unique edit sets
Traceback (most recent call last):
  File "ochem_predict_nn/scripts/lowe_interactive_predict.py", line 257, in <module>
    score_candidates(reactants, candidate_list, xs)
  File "ochem_predict_nn/scripts/lowe_interactive_predict.py", line 157, in score_candidates
    pred = model.predict(xs, batch_size = 20)[0]
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1572, in predict
    batch_size=batch_size, verbose=verbose)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1202, in _predict_loop
    batch_outs = f(ins_batch)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/theano_backend.py", line 1094, in __call__
    return self.function(*inputs)
  File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 898, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/usr/local/lib/python2.7/dist-packages/theano/gof/link.py", line 325, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 884, in __call__
    self.fn() if output_subset is None else\
  File "/usr/local/lib/python2.7/dist-packages/theano/gof/op.py", line 872, in rval
    r = p(n, [x[0] for x in i], o)
  File "/usr/local/lib/python2.7/dist-packages/theano/tensor/blas.py", line 1544, in perform
    z[0] = numpy.asarray(numpy.dot(x, y))
ValueError: ('shapes (285,43) and (32,200) not aligned: 43 (dim 1) != 32 (dim 0)', (285, 43), (32, 200))
Apply node that caused the error: Dot22(Reshape{2}.0, flatten_H_lost/embed H_lost 1/kernel)
Toposort index: 71
Inputs types: [TensorType(float32, matrix), TensorType(float32, matrix)]
Inputs shapes: [(285, 43), (32, 200)]
Inputs strides: [(172, 4), (800, 4)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[Elemwise{Composite{tanh((i0 + i1))}}[(0, 0)](Dot22.0, InplaceDimShuffle{x,0}.0)]]

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

[2]Regarding probs.dat, true_smiles is the predicted smiles with the highest probability, not a copy of the main product from the database, am I right?

connorcoley commented 7 years ago

[1] Based on the two numbers (43 v. 32), it looks to be a difference between the different versions of atom-level descriptors. The paper uses basic structural descriptors (32 of them), although I later expanded the descriptors later to use a richer featurization (increase to 43). I've changed the three lines where edits_to_vectors appears in the interactive script to now use ORIGINAL_VERSION = True. This explains all of the incompatibilities with lowe_interactive_predict.py; it appears this corresponds to a more recent version of the code than what was published. Very sorry for the confusion.

[2] true_smiles is copied directly from the database - you can see it as the second argument on this line

JasonYCHuang commented 7 years ago

Hi @connorcoley

I use 3 test-set reactions from probs.dat, and the result looks amazing!!

Chemists in our team are looking forward to use your work.

Thank you for having patience on giving me feedback.

connorcoley commented 7 years ago

Great! I'm glad we were able to get everything working for you. While there are plenty of known limitations as discussed in the paper, one of which is template coverage, I'd be interested in hearing about any systematic shortcomings you come across or any ideas you have to improve the model. Internally, we've shifted over to using context-dependent Reaxys data (unpublished so far) and have made some improvements over this context-free, USPTO-trained version.