The action-mapping in DeepSea-bsuite does not behave like the original DeepSea environment

In BSuite, DeepSea generates a random action mapping and keeps this action mapping fixed during resets. The main purpose of a random action mapping is to make sure that a DQN agent can not trivially solve the environment just by having a bias towards the action "right".

Currently, Gymnax's DeepSea-bsuite implementation either:

Uses a deterministic action mapping if deterministic is True.
Randomly generates an action mapping for every reset if both sample_action_map is Trueand deterministic is False.
Uses a default action map, which is set to be a deterministic one, when sample_action_map is False and deterministic is False.

This poses a few problems:

It is not possible to use a random mapping without making the transitions stochastic.
Getting the default behaviour of BSuite, i.e. a fixed random mapping, requires workarounds such as generating the mapping by hand and setting env.action_mapping, or resetting the environment with a fixed key, which is not ideal for general agent-environment loops.

I think problem 1 is just a bug: the deterministic environment parameter should probably just discern between BSuite's "DeepSea" and "DeepSea Stochastic" environment.

Problem 2 could perhaps be fixed by changing the default env.action_mapping, or adding the action_mapping to env_state.

Finally, the randomize_actions environment parameter is currently unused, and it is unclear to me why the option of sample_action_map exists. Surely randomly generating the action mapping at the start of every episode makes the problem completely impossible to solve?

RobertTLange / gymnax

The action-mapping in DeepSea-bsuite does not behave like the original DeepSea environment #77