[Question] Why does DQN in StableBaseline3 not support MultiDiscrete action spaces

PBerit commented 2 years ago

Hi all,

I would like to use Deep-Q-Learning and I have a continuous state space and 3 actions (thus a 3 dimensional action space). What I fould confusing is that the StableBaseline version of DQN can't handle multiple actions spaces according to the offical website https://stable-baselines3.readthedocs.io/en/master/guide/algos.html#.

As far as I understand the MultiDiscrete action space are just multiple actions that have different number of discrete actions. Normally, an Artifical Neural Network, that is used in Q-Learning for the mapping, should be able to map any number of inputs to any number of outputs. So why can the StableBaseline version of Deep-Q-Network only have 1 action variable?

Miffyli commented 2 years ago

Regular DQN only works with a Discrete action space where it can choose one action from many, because it predicts the Q-value for all possible actions. A straight-forward (but crude) approach to do this for MultiDiscrete space is to predict Q-value for all possible combinations of the multiple Discrete choices, but this quickly blows up in size.

For a more sophisticated approach (and more details on the matter), see this paper on branching DQN.

PBerit commented 2 years ago

@Miffyli : Thanks Miffyli for your answer. What do you mean by "Regular DQN only works with a Discrete action space where it can choose one action from many, because it predicts the Q-value for all possible actions"? Normally, a Neural Network can map any inputs to any outputs and Deep-Q-Learning uses a Neural Network. So why can the Stable Baseline Deep-Q-Learning only have 1 action?

Miffyli commented 2 years ago

Normally, a Neural Network can map any inputs to any outputs and Deep-Q-Learning uses a Neural Network. So why can the Stable Baseline Deep-Q-Learning only have 1 action?

I do not have time to explain the whole situation, and I recommend you read more on "Q-learning" (not DQN) and the original DQN paper. The gist is this:

With Q-learning, you aim to predict how good each possible action is ("Q-value"), and then pick action with the highest Q-value (i.e. best outcome)
With Discrete space, you have n possible actions, so you only have to output n Q-values and pick the highest.
With MultiDiscrete space, you have n1 * n2 * n3 * ... possible actions. Each n# is one Discrete dimension with n actions. If you have MultiDiscrete space of, say, [5, 6, 3, 3, 2], you have total of 5 * 6 * 3 * 3 * 2 = 540 unique actions. While the network can output 540 values, the learning process becomes slower / more difficult (more actions to learn the Q-value for).

You may close this issue if you do not have enhancements/issues to report that are specific for stable-baselines :)

PBerit commented 2 years ago

Thanks Miffyli for your answer and help.

DLR-RM / stable-baselines3

[Question] Why does DQN in StableBaseline3 not support MultiDiscrete action spaces #745