connorcoley / scscore

MIT License
93 stars 43 forks source link

How to retrain the reaction NN model #5

Closed chengthefang closed 5 years ago

chengthefang commented 6 years ago

Hi Connor,

Thanks to your help, I have no problem to run the standalone_model_numpy.py. However, I am quite interested into how to retrain the reaction NN model since you mentioned in the ReadMe file. For example, starting from reaction database, how to learn the scscore using NN model. Also, I found you put "nntrain_fingerprit.py" and "standalone_model_tf.py" in the scscore folder as well. I wonder how to implement those two, and when to use them & for what kind of purposes.

Thanks, Cheng

connorcoley commented 6 years ago

Cheng,

You should be able to try training the example model using the (very very small) provided reaction files in data/reaxys_limit10.txt and its corresponding pre-calculated fingerprint file data/reaxys_limit10.txt.h5 by running scscore/nntrain_fingerprint.py.

To train on your own reaction corpus, there are a few steps.

  1. Prepare a text file where each line consists of a reaction SMILES, an integer, and an ID separated by spaces. The integer is intended to be the total number of heavy atoms in the reactants or products (whichever is larger), but it is not actually used in this model. Because it is expected by a number of scripts, I would recommend putting a "0" in there to be safe.

  2. Run scripts/make_h5_file.py <path_to_your_file>. You might want to modify this script to use a different fingerprinting strategy. Right now, the different options are crudely commented out instead of being handled nicely by a command line argument. This script will create a .pkl of the (shuffled) original data and a .h5 containing the fingerprints. This is so that during training, we don't have to regenerate fingerprints every epoch, which can be a rate limiting step.

  3. Run scscore/nntrain_fingerprint.py with the command-line arguments --train <path_to_your_file> and --save_dir <name_for_your_new_model>. Based on a hardcoded value, the model will train for 30 epochs and save checkpoints after every one. The first 80% of the shuffled data file will be used for training.

  4. Evaluate model performance on the validation data set using --test valid to use the validation data subset (10%) for different checkpoints using --checkpoint ckpt-######. Select the "best" model based on validation performance.

  5. Test the selected model (restored via --checkpoint ckpt-###### using --test test and optionally use the --verbose flag to get a detailed output of perceived scores. Alternately, you can use the interactive mode with --interactive and be prompted for SMILES to evaluate.

I may have overlooked some details, but hopefully this gets you started re-training!

Connor

chengthefang commented 6 years ago

@connorcoley Hi Connor, thank you much for your detailed instruction. I will give a try. BTW, I wonder what is the difference between standalone_model_numpy.py and standalone_model_tf.py. Do they generate the same SCScore for a given molecule?

Thanks, Cheng

connorcoley commented 6 years ago

Provided that the same model weights are loaded, they should generate the exact same SCScore. The only difference is in the package used for numerical calculations: tensorflow (tf) or numpy (np). The model was originally developed using tensorflow+gpu; once trained, however, the actual calculation is fast enough that it's not a problem to use numpy+cpu.

chengthefang commented 6 years ago

@connorcoley I see. Thank you much!

Jbru18 commented 6 years ago

Hi Connor,

I tried to train the example model using the reaxys_limit10.txt file, Unfortunately it did not work for using various configurations (Ubuntu with Python 2.7, TF 1.1.0, Rdkit 18.09.1, H5py, numpy 1.11.3, OS X with Python 2.7, TF 0.12.1, Windows with Python 3.5, TF 1.1.12). In each case the code progressed to the following point and then the terminal became unresponsive due to possibly an infinite loop Model size: 669K Added read_data_master Data length: 3 Letting queue fill up (10 s)

To me it seems that I either use a wrong command line input (python scscore-master/scscore_future/nntrain_fingerprint.py -t -m ) or have the incorrect version for one of the prerequites. Can you please let me know the correct command line input and the versions of the packages as well the OS (is the GPU required for the small training set ?) that allowed you to train the model.

Thanks,

J

Jbru18 commented 6 years ago

Hi Connor,

After eliminating some errors introduced duriong the python 2->3 conversion and looking at the option parser I was able to construct a working command line inputs.

python --save_dir --test valid --checkpoint ckpt-6

python --save_dir --test test --checkpoint ckpt-10 –verbose=true

The program runs and saves the desired files. The only remaining issue is that at the end of the train loop the python program does not return to the command prompt. Instead a blinking cursor results. Any thoughts on this ?

Regards,

J

connorcoley commented 6 years ago

Using the --test command line argument will have the model restore from the specified checkpoint and test on the data subset (valid or test) with that saved version of the model. Did you get the training to work properly?

The reaxys_limit10.txt data file is more meant to show the data format expected by the model; the "trained model" you get out will only have seen 10 reactions. I have noticed that occasionally, the data preprocessing thread will not join (line 517 of nntrain_fingerprint.py) and so the training script will hang at that line, though training is complete and you can kill the process with control+c or control+z.

Jbru18 commented 6 years ago

I used the following command line: python scscore-master/scscore/nntrain_fingerprint.py -t scscore-master/data/reaxys_limit10_p3.txt -m scscore-master/models/20181127_model -b 2. The code executes and saves a file and then hangs (control+c or contro+z do not work).

However I am not sure it actually trains. The model history file shows the same values for the variables all the iterations (eg 000001 of 000060 [000000016 pairs seen], AvgDiff: -0.00, FracDiffPos: -0.000, FracDiff0.25: -0.000, PNorm: 0.00, GNorm: -0.00, Loss: 2.0000 Ex: 3.00>>3.00 -> diff = 0.00 Ex: ID742 === Cc1ccc(S(=O)(=O)OC[C@H]2OCO[C@@H]3C@@HOCO[C@H]23)cc1>>OCCNC[C@H]1OCO[C@@H]2C@@HOCO[C@H]12) I am not sure if this is the expected result for the small training set or a indication it is not working. Can you please let me know which command you used to generate the example model.

Also how would I restart/ continue the training on your model using in-house reaction data ?

connorcoley commented 6 years ago

You should be able to just run python scscore-master/scscore/nntrain_fingerprint.py without any additional arguments.

The issue with your training may be your batch size of 2. The default batch size is 16384 - you should see the model dramatically overfitting to the training data of the example model. If you decrease the batch size significantly, you will likely need to change the learning rate to compensate. The learning rate is set quite low so the model doesn't overfit to the initial batches of data it sees; you can change the learning rate (line is: lr = 0.001) to be higher if you want to keep the batch size at 2, though I don't recommend it.

Jbru18 commented 5 years ago

Thanks for the guidance. I found that the Python 2 -> Python 3 conversion had created an error in some of the divisions. After correcting the errors the code now runs and the loss is now progressively reduced.

connorcoley commented 5 years ago

Ah glad to hear it!