Open Jbru18 opened 6 years ago
The model itself is a simple feedforward neural network model, which can theoretically take any numerical descriptor vector as input. The scripts/make_h5_file.py
file contains a few different variations of Morgan fingerprint calculations or different lengths and boolean/integer values. We trained a few different models that each used a single one of these input representations.
The code doesn't use any similarity calculations. However, to test different fingerprints or representations, all that needs to be done is preparing a different .h5
file (matrix) by changing the mol_to_fp
function in scripts/make_h5_file.py
and setting the FP_len
in the training script. For testing the retrained model, you would need to define the mol_to_fp
function in nntrain_fingerprint.py
to match the right input representation
Thanks for the explanation. I have very limited experience with the neural networks. Does the model tolerate variable length fingerprints or do they all need to have the length ? As the code does not use similarity how do different types of chemistry where no crossover from the reaction networks exist at lower complexity the scscore (eg. carbohydrate chemistry vs peptide or aromatic chemistry) ? Your article also mentioned that molecules frequently encountered as starting materials would have lower a scscore. How about reaction chains that frequently end with less structurally diverse molecules ? Would a bias towards a higher scscore be observed ?
This model architecture (standard feed forward neural network) does not tolerate variable length fingerprints. As a type of nonlinear regression, the model capacity is large enough that those different types of molecules can "exist" in different areas of the network. So certain parts will be more relevant for carbohydrate structures, others will be more relevant for aromatic compounds, etc.
Motifs that appear more often in products (or at the end of a longer chain of reactions) will be biased toward a higher scscore, yes
Thanks for the insight! How do I continue training of the existing reaxys model ? Do I need to start from a checkpoint ?
Yep! You'd want to restore from the checkpoint after initializing the saver, then continue on with training
Thanks Connor.
Is the latest checkpoint which includes the reaxys data set 'model.ckpt-10654' ? Can please let me know what the command line is for initializing the the saver and restarting the training.
Thanks,
J
Yes, that is the Reaxys-trained model. Restoring weights for testing begins on this line. To restore weights and continue training, you can set the checkpoint
flag to ckpt-10654
. You'll also want to delete line 319 of that file and unindent lines 320-327 so that the script restores weights even when it isn't testing or being interactive.
HI Connor,
From your previous post I learned that a model can be retrained. I am unclear, however, on how to train a model from scratch or expand (augment the existing) training. Can you please explain a little further.
I am also interested in using different fingerprints and /or similarity criteria. I was not clear what similarity criteria you used for comparing the fingerprints in the current version. Can you please shed some light on it.
Looking at the code I could not decipher whether multiple fingerprints (as in the ECFP's) for a molecule were generated and compared or just one as typical for the Morgan algorithm.
Thanks, J