binghong-ml / retro_star

Retro*: Learning Retrosynthetic Planning with Neural Guided A* Search
MIT License
127 stars 29 forks source link

Details of construction for synthesis routes dataset #2

Closed hankook closed 3 years ago

hankook commented 3 years ago

I am sincerely thankful for this very interesting work and shared code.

I'm currently trying to construct the synthesis route dataset as you described in the paper, but I failed to obtain 299202 training routes. The following details are obtained from my own construction code:

So I have some questions for construction:

Thanks in advance.

Hankook Lee.

binghong-ml commented 3 years ago

Hi Hankook,

We followed the procedures described in the paper to construct the dataset. You are actually using the correct approach but I am not sure whether we start with the same USPTO training split at the beginning.

You can download our training routes here.

This is a list of 299202 routes. Each route is a list of reactions as in the following example: ['Cc1ccc(C(=O)N2CCC(c3ccc(-c4cnn(C)c4)cc3)CC2)cc1NC(=O)c1ccc(NC(C)C)nc1>>Cc1ccc(C(=O)N2CCC(c3ccc(-c4cnn(C)c4)cc3)CC2)cc1NC(=O)c1ccc(Cl)nc1.CC(C)N', 'Cc1ccc(C(=O)N2CCC(c3ccc(-c4cnn(C)c4)cc3)CC2)cc1NC(=O)c1ccc(Cl)nc1>>O=C(Cl)c1ccc(Cl)nc1.Cc1ccc(C(=O)N2CCC(c3ccc(-c4cnn(C)c4)cc3)CC2)cc1N', 'Cc1ccc(C(=O)N2CCC(c3ccc(-c4cnn(C)c4)cc3)CC2)cc1N>>Cn1cc(-c2ccc(C3CCNCC3)cc2)cn1.Cc1ccc(C(=O)O)cc1N']

Each reaction is a string of "product >> reactantA.reactantB.reactantC".

The size of eMolecules is 23M. Thanks for spotting the typo!

Best, Binghong

hankook commented 3 years ago

Wow. Thank you for sharing the training routes!

Could you upload other splits (validation/test)? If possible, I want to see the original test routes (I think it may be 10k~20k routes) without cleaning. It will be very useful for demonstrating the generalizability of template-free retro models (e.g., Transformer) with your search algorithm (Retro*) toward unseen reaction templates.

Thanks in advance.

Best, Hankook Lee

binghong-ml commented 3 years ago

Hi Hankook,

Please find the validation set and test set in the links. About the test set, I am not entirely sure whether this is the version before cleaning, but you can use it to test your ideas anyways.

Binghong

hankook commented 3 years ago

Thanks for sharing, but I currently do not have access to the validation/test sets. Could you give access for downloading the files?

Best, Hankook Lee

binghong-ml commented 3 years ago

Thanks for bringing it to our attention! I've fixed the links. Please see if you can access them.

hankook commented 3 years ago

Thanks for sharing the data!