connorcoley / scscore

MIT License
93 stars 41 forks source link

How to train a new model from scratch/ continue training #8

Open Jbru18 opened 5 years ago

Jbru18 commented 5 years ago

HI Connor,

From your previous post I learned that a model can be retrained. I am unclear, however, on how to train a model from scratch or expand (augment the existing) training. Can you please explain a little further.

I am also interested in using different fingerprints and /or similarity criteria. I was not clear what similarity criteria you used for comparing the fingerprints in the current version. Can you please shed some light on it.

Looking at the code I could not decipher whether multiple fingerprints (as in the ECFP's) for a molecule were generated and compared or just one as typical for the Morgan algorithm.

Thanks, J

connorcoley commented 5 years ago

The model itself is a simple feedforward neural network model, which can theoretically take any numerical descriptor vector as input. The scripts/make_h5_file.py file contains a few different variations of Morgan fingerprint calculations or different lengths and boolean/integer values. We trained a few different models that each used a single one of these input representations.

The code doesn't use any similarity calculations. However, to test different fingerprints or representations, all that needs to be done is preparing a different .h5 file (matrix) by changing the mol_to_fp function in scripts/make_h5_file.py and setting the FP_len in the training script. For testing the retrained model, you would need to define the mol_to_fp function in nntrain_fingerprint.py to match the right input representation

Jbru18 commented 5 years ago

Thanks for the explanation. I have very limited experience with the neural networks. Does the model tolerate variable length fingerprints or do they all need to have the length ? As the code does not use similarity how do different types of chemistry where no crossover from the reaction networks exist at lower complexity the scscore (eg. carbohydrate chemistry vs peptide or aromatic chemistry) ? Your article also mentioned that molecules frequently encountered as starting materials would have lower a scscore. How about reaction chains that frequently end with less structurally diverse molecules ? Would a bias towards a higher scscore be observed ?

connorcoley commented 5 years ago

This model architecture (standard feed forward neural network) does not tolerate variable length fingerprints. As a type of nonlinear regression, the model capacity is large enough that those different types of molecules can "exist" in different areas of the network. So certain parts will be more relevant for carbohydrate structures, others will be more relevant for aromatic compounds, etc.

Motifs that appear more often in products (or at the end of a longer chain of reactions) will be biased toward a higher scscore, yes

Jbru18 commented 5 years ago

Thanks for the insight! How do I continue training of the existing reaxys model ? Do I need to start from a checkpoint ?

connorcoley commented 5 years ago

Yep! You'd want to restore from the checkpoint after initializing the saver, then continue on with training

Jbru18 commented 5 years ago

Thanks Connor.

Is the latest checkpoint which includes the reaxys data set 'model.ckpt-10654' ? Can please let me know what the command line is for initializing the the saver and restarting the training.

Thanks,

J

connorcoley commented 5 years ago

Yes, that is the Reaxys-trained model. Restoring weights for testing begins on this line. To restore weights and continue training, you can set the checkpoint flag to ckpt-10654. You'll also want to delete line 319 of that file and unindent lines 320-327 so that the script restores weights even when it isn't testing or being interactive.