Closed hankook closed 3 years ago
Hi Hankook,
We followed the procedures described in the paper to construct the dataset. You are actually using the correct approach but I am not sure whether we start with the same USPTO training split at the beginning.
You can download our training routes here.
This is a list of 299202 routes. Each route is a list of reactions as in the following example: ['Cc1ccc(C(=O)N2CCC(c3ccc(-c4cnn(C)c4)cc3)CC2)cc1NC(=O)c1ccc(NC(C)C)nc1>>Cc1ccc(C(=O)N2CCC(c3ccc(-c4cnn(C)c4)cc3)CC2)cc1NC(=O)c1ccc(Cl)nc1.CC(C)N', 'Cc1ccc(C(=O)N2CCC(c3ccc(-c4cnn(C)c4)cc3)CC2)cc1NC(=O)c1ccc(Cl)nc1>>O=C(Cl)c1ccc(Cl)nc1.Cc1ccc(C(=O)N2CCC(c3ccc(-c4cnn(C)c4)cc3)CC2)cc1N', 'Cc1ccc(C(=O)N2CCC(c3ccc(-c4cnn(C)c4)cc3)CC2)cc1N>>Cn1cc(-c2ccc(C3CCNCC3)cc2)cn1.Cc1ccc(C(=O)O)cc1N']
Each reaction is a string of "product >> reactantA.reactantB.reactantC".
The size of eMolecules is 23M. Thanks for spotting the typo!
Best, Binghong
Wow. Thank you for sharing the training routes!
Could you upload other splits (validation/test)? If possible, I want to see the original test routes (I think it may be 10k~20k routes) without cleaning. It will be very useful for demonstrating the generalizability of template-free retro models (e.g., Transformer) with your search algorithm (Retro*) toward unseen reaction templates.
Thanks in advance.
Best, Hankook Lee
Hi Hankook,
Please find the validation set and test set in the links. About the test set, I am not entirely sure whether this is the version before cleaning, but you can use it to test your ideas anyways.
Binghong
Thanks for sharing, but I currently do not have access to the validation/test sets. Could you give access for downloading the files?
Best, Hankook Lee
Thanks for bringing it to our attention! I've fixed the links. Please see if you can access them.
Thanks for sharing the data!
I am sincerely thankful for this very interesting work and shared code.
I'm currently trying to construct the synthesis route dataset as you described in the paper, but I failed to obtain 299202 training routes. The following details are obtained from my own construction code:
raw_train.csv
(from GLN) for USPTO training dataset, andorigin_dict.csv
(from the shared link) for eMolecules.So I have some questions for construction:
Thanks in advance.
Hankook Lee.