frnsys / retrosynthesis_planner

Retrosynthesis planner
GNU General Public License v3.0
59 stars 21 forks source link

Extracted transformation rules are duplicated which results in RolloutPolicyNet and ExpansionPolicyNet not having enough training data #1

Open zas97 opened 5 years ago

zas97 commented 5 years ago

In extract_templates.py, in line 280 if we add

print(len(set(transforms)))
print(len(transforms))

we get that there are only 1.254.409 millions unique transformations out of the 3.405.187 transformations extracted which means that 2/3rds of the transformation rules are duplicated.

This poses a problem later when training the RolloutPolicyNet and ExpansionPolicyNet because in theory we should have for each rule in the RolloutPolicyNet at least 15 samples, however, because most of those samples are duplicated we get in reality 1 or 2 samples per rule which is not enough for training.

Here I plotted the histogram of the number of samples per rule and we can see that for most rules we only have one sample:

plot_hist_training

Any thoughts on this issue?

yangxiufengsia commented 4 years ago

@zas97 did you plot the wrong data? The sum of the data in the figure seems less than 20000?