Closed simonsays1980 closed 2 years ago
You could compute an approximation to the optimal policy by training several agents for a long time on environment A, and then train an agent on environment B, and ask it to generalize to environment A. Then you could look at the difference between the best policy trained specifically for A vs the policy you trained on B and transferred to A.
@maximecb Thanks for the answer! That is a nice way to circumvent the theoretical solution finding. How complicated do you think is constructing the MDP (with this than value iteration, etc. is possible) is it for the MiniGrid environments? Especially when they are initialized randomly.
Honestly not sure. It's going to be a lot faster for smaller environments. You can also fix the starting configuration by calling env.seed(your_seed_value)
each time you do env.reset()
if you want.
Thanks fo the insights @maximecb !
Regarding the randomness. Is NonDeterministic
a hyperparameter (to infuse randomness) or solely a description of the environment? I guess I need to make a deeper dive into the functionality of the environments.
There is something I recognized when training some agents on EmptyGrid-8x8-v0
: it appears from the render videos that the agent sometimes stands still as if there is also a wait
or do-nothing
action. Is this the case?
The environment is deterministic but the position of the agent and configuration object can vary randomly depending on the seed for each episode.
There are actions like toggle and drop which don't apply to all environments: https://github.com/maximecb/gym-minigrid/blob/master/gym_minigrid/minigrid.py#L638
Hi,
and thanks for providing this environment to the community! I am planning to use MiniGrid for some research in regard to generalization. For this I would like to compute the deviation of learned policies from optimal one and learned Q-values from optimal ones etc.
Is there any chance to do this with the information contained in the environment?
I mean different to the
ForzenLake
environment the MiniGrid environments do not possess the transition probabilities.Do you have any idea?