MCTS not giving what is expected

queliyong commented 1 year ago

Hi, I find a hidden bug in current MCTS. The two routes bellow are almost the same despite the templates at first step are different. 下载 (1) And by accident, I find a shorter route which can hardly be found by current MCTS. 下载 (2) The reason I guess is that despite after first iteration, the Q and N updated along the route, but in the second iteration, when different template is chosen, and the same expandable molecule is encountered, the experience got from the first iteration is gone. So, my suggestion is the Q and N should be inherited from past iteration when the same expandable molecule is encountered in the next iteration. By the way, I find several target molecules have the same problem when I check their routes.

SGenheden commented 1 year ago

Thanks so much for the feedback. This is very interesting, and I would like to look closer at this. Could you be so kind and provide details about your setup:

what is the input SMILES?
what is the stock, expansion, and filter models?
can you provide me a configuration file in case you have non-default settings?

SGenheden commented 1 year ago

I think in general that the first case, when two templates generate the same pre-cursors only one child node should be added to the search tree. This is something that we have seen before, and is on our to-do list to fix.

queliyong commented 1 year ago

The target is 'C#CC1=CC=CC(NC2=NC=NC3=CC(OCCOC)=C(OCCOC)C=C32)=C1'. The stock, expansion are regular default respectively: zinc, full_uspto, and none filter. The shorter route I found just sets up: cutoff_number:10, iteration_limit: 1000.

queliyong commented 1 year ago

I think in general that the first case, when two templates generate the same pre-cursors only one child node should be added to the search tree. This is something that we have seen before, and is on our to-do list to fix.

This is a better idea.

SGenheden commented 1 year ago

After reviewing this closely, I have three remarks:

Grouping near-identical precursors like in the first example, is a request for a new feature. This was never the intention of the MCTS to have that level of knowledge. We will implement grouping of identical precursors, but to group near-identical precursors might be or might not be what a user expects from the algorithm. How to generalize this to no. of precursors > 2 is also something that is not clear how to do.
The shorter routes are found after more than 200 iterations, because the template that generates this pre-cursors is ranked lower than the template leading to a longer route. The probability is roughly 0.51 vs 0.14 which is turned into priors for the MCTS algorithm. Thus this is not a bug either but an inherent feature of the algorithm + expansion policy. How to "fix" this is a research question.
I want to make you aware that there are newer USPTO models. I am lead to believe that you are using the old one, as you are referring to it as "full_uspto".

queliyong commented 1 year ago

Hi, SG: Thank u for your immediate reply. But I have no idea if you got what I meant. In your first remark, I wanna know if the meaning of "grouping" is "when two templates generate the same pre-cursors only one child node should be added to the search tree" you mentioned above? if so, I think grouping near-identical precursors like the first step of the two longer routes is meaningful, because it is the same expandable mol at first step that leads to the same steps afterwards in the second route compared with the first route. So it should be grouped when the same expandable mol is encountered from a different template at next iteration, then I believe the shorter route may be found within 100 iteration despite of its lower probability. By the way, I can not get the meaning of "How to generalize this to no. of precursors > 2 is also something that is not clear how to do", my idea is no matter how many number of precursors, as long as a same expandable mol is found within the precursors from a different template, it should be grouped. In your second remark, I agree it is an inherent feature of the algorithm + expansion policy. And I think what I mentioned above could fix this: illuminated by your comment "when two templates generate the same pre-cursors only one child node should be added to the search tree", my idea is that when two templates generate near-identical precursors, in which there is a same expandable mol, only one child node should be added to the search tree. Then the shorter route may be found easily within 100 iteration despite of its lower probability. In your third remark, I indeed used the older uspto expansion policy named "full_uspto_model" with 46695 unique templates. And I also find the newest USPTO model named "uspto_keras_model" is more powerful than the older one, despite the number of its unique templates fewer than 46695. So I wanna learn how to filter and curate it from the older templates, is there any publication to be followed? Thank u very much.

SGenheden commented 1 year ago

We will adress this in a future release: There will be an option controlling the grouping of sibling nodes in the MCTS algorithm. I will have three valid choices:

No grouping - this is the current behavior, and will continue to be the default
Group on identical - this will only group sibling nodes if all molecules / pre-cursors in the two nodes are the same
Group on expandable - this will only group sibling nodes if all expandable molecules in the two nodes are the same (the molecules / precursors in stock will be excluded from the comparison) The difference between 2. and 3. is a really minor implementation issue, so we will leave the choice to the user. Hopefully this will enable the search to identify the short route in your example.

SGenheden commented 1 year ago

Regarding your question on the USPTO dataset: I am not following what you mean with "learn how to filter and curate it from the older templates"? The new USPTO model is described in this publication: https://pubs.acs.org/doi/10.1021/acs.jcim.2c01486 and is not easily comparable to the old one because of the very different approaches to manipulate the reaction data.

queliyong commented 1 year ago

Thank you very much for your advice and discussion. I'll close it.

queliyong commented 9 months ago

We will adress this in a future release: There will be an option controlling the grouping of sibling nodes in the MCTS algorithm. I will have three valid choices:

No grouping - this is the current behavior, and will continue to be the default

Group on identical - this will only group sibling nodes if all molecules / pre-cursors in the two nodes are the same

Group on expandable - this will only group sibling nodes if all expandable molecules in the two nodes are the same (the molecules / precursors in stock will be excluded from the comparison) The difference between 2. and 3. is a really minor implementation issue, so we will leave the choice to the user. Hopefully this will enable the search to identify the short route in your example.

Hello, Samuel: Is this new release still on your agenda? or this plan has been cancelled? Thank u very much.

SGenheden commented 9 months ago

Hello. We are in the process of preparing a new release that will include these features It should be available in the beginning of December

SGenheden commented 9 months ago

Feature for grouping sibling nodes available in new release:

search:
  algorithm_config:
    mcts_grouping: partial

queliyong commented 9 months ago

Feature for grouping sibling nodes available in new release:
search:
  algorithm_config:
    mcts_grouping: partial

So kind of u! Does it work for retrostar after a little of modification?

SGenheden commented 9 months ago

We haven't introduce this for Retro*, our focus has been on the MCTS algorithm. The implemention would be completely different for Retro*. If you are really interested in a Retro* implementation, please make a new issue with a feature request.

MolecularAI / aizynthfinder

MCTS not giving what is expected #110