binghong-ml / retro_star

Retro*: Learning Retrosynthetic Planning with Neural Guided A* Search
MIT License
134 stars 32 forks source link

how to build the route training input from the routes #10

Open eacwecwecd opened 3 years ago

eacwecwecd commented 3 years ago

Hi binghong

I am currently confused about building AI inputs from the routes like

['Cc1ccc(C(=O)N2CCC(c3ccc(-c4cnn(C)c4)cc3)CC2)cc1NC(=O)c1ccc(NC(C)C)nc1>>Cc1ccc(C(=O)N2CCC(c3ccc(-c4cnn(C)c4)cc3)CC2)cc1NC(=O)c1ccc(Cl)nc1.CC(C)N', 'Cc1ccc(C(=O)N2CCC(c3ccc(-c4cnn(C)c4)cc3)CC2)cc1NC(=O)c1ccc(Cl)nc1>>O=C(Cl)c1ccc(Cl)nc1.Cc1ccc(C(=O)N2CCC(c3ccc(-c4cnn(C)c4)cc3)CC2)cc1N', 'Cc1ccc(C(=O)N2CCC(c3ccc(-c4cnn(C)c4)cc3)CC2)cc1N>>Cn1cc(-c2ccc(C3CCNCC3)cc2)cn1.Cc1ccc(C(=O)O)cc1N'].

I had a look at the code at the trainer.py and the .pt data file, the training input was fps, values, r_costs, t_values, r_fps, r_masks from the training routes data. Could you share the information (code better) about how can I buld the the AI input from the routes data.

binghong-ml commented 3 years ago

The routes are stored in the following format: [reaction 1, reaction 2, ... reaction N]

Each reaction is a string in the following format: 'product_smiles>>reactants_smiles(separated by .)]'

For the fields in the training set, they are:

positive training example: fps: fingerprint of the target molecule values: ground truth cost of the target molecule

negative training example: r_values: sum of reactant values in a negative reaction sample r_costs: negative reaction cost t_values: ground truth cost of the target molecule

where r_values are computed using r_fps and r_masks r_fps: reactants of the fingerprints r_masks: r_fps are of different lengths, we pad the list of r_fps into a fixed size matrix, r_masks are 0/1 masks used to filter out the padded 0s (see https://github.com/binghong-ml/retro_star/blob/master/retro_star/trainer/trainer.py#L43 for usage)