Extracted transformation rules are duplicated which results in RolloutPolicyNet and ExpansionPolicyNet not having enough training data

In extract_templates.py, in line 280 if we add

print(len(set(transforms)))
print(len(transforms))

we get that there are only 1.254.409 millions unique transformations out of the 3.405.187 transformations extracted which means that 2/3rds of the transformation rules are duplicated.

This poses a problem later when training the RolloutPolicyNet and ExpansionPolicyNet because in theory we should have for each rule in the RolloutPolicyNet at least 15 samples, however, because most of those samples are duplicated we get in reality 1 or 2 samples per rule which is not enough for training.

Here I plotted the histogram of the number of samples per rule and we can see that for most rules we only have one sample:

plot_hist_training

Any thoughts on this issue?

frnsys / retrosynthesis_planner

Extracted transformation rules are duplicated which results in RolloutPolicyNet and ExpansionPolicyNet not having enough training data #1