eleurent / rl-agents

Implementations of Reinforcement Learning and Planning algorithms
MIT License
591 stars 153 forks source link

stochastic transitions for tree search agents #44

Closed saArbabi closed 4 years ago

saArbabi commented 4 years ago

Eleurent, thanks for developing this great project and sharing it.

To my understanding, currently the MCTS agent deterministically transitions to new states during the planning phase. I was wondering what class you would modify for considering stochastic transitions? For instance, in case gaussian noise was added to the actions executed by other agents.

Thank you in advance

eleurent commented 4 years ago

Hello @saArbabi This was the case until very recently (see #43), now the UCT algorithm ("mcts agent") uses a different random seed for each rollout during the planning phase (see this line).

Now, note that there are two ways in which stochastic transitions can be handled:

Open-loop algorithms are sub-optimal compared to closed loop algorithms. However, closed-loop planning algorithms implemented here only work when the support of the transition distribution is finite. Indeed, a new node is attributed for every random state encountered, and serves as the root of the resulting planning subtree. But with e.g. Gaussian noise, the next state will never be encountered twice, which means that the algorithm is going to keep creating new nodes with a single visit and won't be able to explore/exploit.

I am not familiar with any tree-based planning algorithm that handles stochastic transitions with infinite support (e.g. Gaussian). This would require an ability to aggregate together similar next states, based on some good criteria. I think that two approaches can be considered for you:

Does that help?

saArbabi commented 4 years ago

Thanks @eleurent for the quick response!

I need to spend some time digging dipper into the code, to ensure I am fully understanding your suggestion. Having done some research, I know that for infinite support (e.g. actions perturbed by Gaussian noise), UCT with progressive widening (PW) is used. PW is also used for handling continuous observations in case of POMDPs. If I make any useful progress I will for sure share it with you/make a pull request. For now I will close the issue.