binghong-ml / retro_star

Retro*: Learning Retrosynthetic Planning with Neural Guided A* Search
MIT License
127 stars 29 forks source link

how to work with the output #12

Closed csulok closed 3 years ago

csulok commented 3 years ago

hey! I found your paper very interesting and while looking at some test molecules of mine where the search is successful vs it's not, I got lost a bit in how the default serialized output is to be used, and why it recommends certain steps. do you have any tips or pointers to how the serialized output might be adaptable to point back to original template reaction?

furthermore, I'm trying to determine how the output is affected by updates to the emolecules dataset (e.g. the ~3 million new starting materials available now, compared to 2019), but I cannot easily find instructions on how to create a new origin_dict.csv from a "versions.smi" provided by emolecules. the idx column appears to be similar to the parent IDs provided by emolecules, but upon randomly testing a few IDs, they don't seem to match. what is the source of these ids?

binghong-ml commented 3 years ago

Hi! Thanks for your interest in our work!

If you want to print the template information as well, take a look at this file which handles the serialization of the search results. https://github.com/binghong-ml/retro_star/blob/master/retro_star/alg/syn_route.py Add self.templates[idx] to serialize_reaction method to print it from the outside.

For your second question, I think Emolecules update their dataset every month (https://downloads.emolecules.com/free/). It's possible that we use an earlier version of the dataset, but that won't affect the result too much I think, as long as your training and test sets are consistent, i.e. are from the same distribution or dataset.

csulok commented 3 years ago

hey! thanks for the clarifications.

I've taken a look at syn_route.py and I was able to add the template reaction strings, thanks for the suggestion! For the smiles input of CC(C)=CCC[C@@H](CO)C1=C(C)C=CC=C1 (on default parameters) I was able to extract the best route's 3 reactions that retro_star suggests as output. What I still miss however is the source information for it though. Let me clarify:

Using a modified syn_route.py, this is the output:

CC(C)=CCC[C@@H](CO)C1=C(C)C=CC=C1  using ([C:2]-[CH2;D2;+0:1]-[OH;D1;+0:3])>>O-[C;H0;D3;+0:1](-[C:2])=[O;H0;D1;+0:3] at cost of 0.2289   CC(C)=CCC[C@@H](C(=O)O)c1ccccc1C

then

CC(C)=CCC[C@@H](C(=O)O)c1ccccc1C  using ([C:1]-[C@H;D3;+0:2](-[c:3])-[C:4](=[O;D1;H0:5])-[O;D1;H1:6])>>[C:1]-[CH;D3;+0:2](-[c:3])-[C:4](=[O;D1;H0:5])-[O;D1;H1:6] at cost of 0.0264   CC(C)=CCCC(C(=O)O)c1ccccc1C

then

CC(C)=CCCC(C(=O)O)c1ccccc1C  using ([C:2]-[CH2;D2;+0:1]-[CH;D3;+0:6](-[c:7])-[C:4](=[O;D1;H0:3])-[O;D1;H1:5])>>I-[CH2;D2;+0:1]-[C:2].[O;D1;H0:3]=[C:4](-[O;D1;H1:5])-[CH2;D2;+0:6]-[c:7] at cost of 0.0331   CC(C)=CCCI.Cc1ccccc1CC(=O)O

Here the alkylation, chiral resolution, and reduction of the acid to the alcohol steps make sense, but I'd really like to match this to the original reactions to get patent IDs. I performed substructure (subreaction?) search in the 2001_Sep2016_USPTOapplications_smiles.rsmi file to verify that it indeed works like I thought based on page 7 of the paper. Based on this I think the information I'm looking for (in this output) is lost in the rdchiral template extraction process.

Unless I'm mistaken, I will follow up by digging into rdchiral and please feel free to close this issue!

binghong-ml commented 3 years ago

I see what you meant. We did not record the original chemical reactions that match the same substructure (SMARTS). If you want to trace back to those reactions, I suggest that you extract the reaction templates once again and record that info. I am closing this issue. Thanks!