eleurent / rl-agents

Implementations of Reinforcement Learning and Planning algorithms
MIT License
582 stars 152 forks source link

Flawed Management of internal search environments in tree search planning #43

Closed amarildolikmeta closed 4 years ago

amarildolikmeta commented 4 years ago

Some tree search algorithms implemented might have flawed high performances because of the management of the environments inside the tree search. Specifically to conduct the search, the environment is copied and passed to the planners, but the environment seed is copied as well. This results in a kind of "foreseeing the future", because the planners optimize on the random realizations instead of in expectation. This happens in the OLOP planner and also in the deterministic planner (ODP). In the deterministic planner is not that serious since it is thought for deterministic transitions, but in practice if you run this planner with a stochastic environment, the effect is that it performs amazingly well (because it can "predict" the exact future realizations). This can be easily fixed by setting a random seed to the environments after copying them for the planners, e.g. adding the seeding in the plan method of the planner.

eleurent commented 4 years ago

Hi @amarildolikmeta, that's absolutely right.

I initially experimented a lot with deterministic MDPs, which is why I didn't really notice this issue. I eventually fixed it on a development branch (see e.g. a80279bb128006c1a4d263c805f10b7fcc17066b), and completely forgot to backport the fix to master.

I am planning to merge this branch soon, and will check then that everything is in order.

amarildolikmeta commented 4 years ago

Hi @eleurent, thanks for the quick response. Perfect than I will close the issue when the branches are merged.

eleurent commented 4 years ago

It should be fine now: https://github.com/eleurent/rl-agents/search?q=%22state.seed%22&unscoped_q=%22state.seed%22

saArbabi commented 4 years ago

@amarildolikmeta/ @eleurent , Could you clarify how seeding the environment object prior to each tree search iteration creates randomness? I am not sure what is actually being random. Thanks.

amarildolikmeta commented 4 years ago

Depends on what you call randomness. What it achieves is the fact that the planner does not "see the future". Without changing the seed, the planner environments have the same seeds as the true environment which means that what the planner is actually doing is just choosing the best realization possible knowing what the outcome will be. If you change the seed the realization of the transitions inside the planning tree will be different from the ones in the "true" environment, and the planner will do what it is supposed to: optimize the values in expectation. On the other hand, this still makes the results reproducable.

On Mon, 13 Jul 2020, 08:52 saArbabi, notifications@github.com wrote:

@amarildolikmeta/ @eleurent https://github.com/eleurent , Could you clarify how seeding the environment object prior to each tree search iteration creates randomness? I am not sure what is actually being random. Thanks.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/eleurent/rl-agents/issues/43#issuecomment-657391331, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEWGHXSXWBQ3L4GOHF444X3R3KVMTANCNFSM4OGOLKTA .

saArbabi commented 4 years ago

Thanks for the reply, really appreciate it!

You see, seeding the environment makes sense to me in a RL setting. During training, you want to seed the environment so that a) the agent is not trained in the same scenario all the time and b) you can reproduce experiments.

However, in a tree search setting, I still lack understanding. I do not understand how seeding the environment object will randomize the transitions during planning (which is what we want to optimize values in expectation). Transitions result from actions of agents in the scene, so unless there is randomness in the actions, I cannot see how transitions will become stochastic? I might be missing something very obvious/silly!


From: Amarildo Likmeta notifications@github.com Sent: 13 July 2020 08:00 To: eleurent/rl-agents rl-agents@noreply.github.com Cc: Arbabi, Salar (PG/R - Mech. Eng. Sci.) s.arbabi@surrey.ac.uk; Comment comment@noreply.github.com Subject: Re: [eleurent/rl-agents] Flawed Management of internal search environments in tree search planning (#43)

Depends on what you call randomness. What it achieves is the fact that the planner does not "see the future". Without changing the seed, the planner environments have the same seeds as the true environment which means that what the planner is actually doing is just choosing the best realization possible knowing what the outcome will be. If you change the seed the realization of the transitions inside the planning tree will be different from the ones in the "true" environment, and the planner will do what it is supposed to: optimize the values in expectation. On the other hand, this still makes the results reproducable.

On Mon, 13 Jul 2020, 08:52 saArbabi, notifications@github.com wrote:

@amarildolikmeta/ @eleurent https://github.com/eleurent , Could you clarify how seeding the environment object prior to each tree search iteration creates randomness? I am not sure what is actually being random. Thanks.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/eleurent/rl-agents/issues/43#issuecomment-657391331, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEWGHXSXWBQ3L4GOHF444X3R3KVMTANCNFSM4OGOLKTA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feleurent%2Frl-agents%2Fissues%2F43%23issuecomment-657392695&data=02%7C01%7Cs.arbabi%40surrey.ac.uk%7Cec5c39fa98bb44914eca08d826fa7ab7%7C6b902693107440aa9e21d89446a2ebb5%7C0%7C0%7C637302204528708101&sdata=1mS7PU9yHBFTlDa2qXvXjjo3Sx4N%2BYphz8vx1TmTM80%3D&reserved=0, or unsubscribehttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAK5ZOXILJFD73U76WR3XDSDR3KWKFANCNFSM4OGOLKTA&data=02%7C01%7Cs.arbabi%40surrey.ac.uk%7Cec5c39fa98bb44914eca08d826fa7ab7%7C6b902693107440aa9e21d89446a2ebb5%7C0%7C0%7C637302204528708101&sdata=m2RHxprslNXplN6yv8JtvfyykRa6ohrIF7uwEpGGleo%3D&reserved=0.

eleurent commented 4 years ago

@amarildolikmeta thanks for the anwser! I will elaborate.

I do not understand how seeding the environment object will randomize the transitions during planning (which is what we want to optimize values in expectation). Transitions result from actions of agents in the scene, so unless there is randomness in the actions, I cannot see how transitions will become stochastic? I might be missing something very obvious/silly!

There are stochastic environments which, when stepped from a given state s with an action a, randomly transition to a next states s'. This randomness is inherent to the environment transitions (typically used to represent noise, perturbations, unmodelled effects) and does not stem from randomness in the actions: a deterministic policy will still yield random trajectories. And this randomness is controlled by a seed for reproducibility.

Now, monte-carlo tree search algorithms are based on the idea that it is possible to sample random trajectories from the current state s (you need a so-called generative model). In this repository, this is implemented by copying the full environment object, which contains its internal state, but also its seed. Thus, when sampling trajectories from the copies of the current state at the root, you will always end up in the same states as if the environment was deterministic, since the RandomState/seed is fixed. Reseeding the copied environments will change the future outcomes when trajectories are sampled, thus reproducing the stochasticity of the dynamics during planning.

saArbabi commented 4 years ago

Thank you both for taking the time. This makes sense.