Dynamic action and observation sizes

prasuchit commented 10 months ago

Hi @adysonmaia,

I'm currently working on a problem that has variable observation and action sizes at every timestep depending on the action and observation from the previous timestep. For instance, say in a game where one or more players die during a battle or more players join as we move through the game.

For simplicity sake, let's assume all players have the same type of obs and action spaces (say box and discrete) and the max number of players that can play the game is fixed from the start. Is this something that can be solved using your MIMO version of PPO?

Say my obs at timestep 1 is: concatenation(player1 state, player2 state) and they do an action (call new player, call new player) and in the next timestep we get an observation of: concatenation(player1 state, player2 state, player3 state) and now my actions have to be a combination of actions available to the 3 players. How do you suggest I approach this problem? Because existing OpenAI gym env and stable-baselines algorithms can't handle variable sized inputs and outputs.

Any advice would be appreciated. Thanks.

adysonmaia commented 10 months ago

Hi @prasuchit,

The MultiOutputPPO doesn't support variable-length observation and action spaces. As each player has the same observation and action spaces in your example, you can set a fixed observation space with size equal to concatenation(player1 state, player2 state, player3 state, ..., player N state) where N is the maximum number of players. Then, in each timestamp, you can use padding for the states of non-playing players. The same thing can be done for the action space. So, I recommend to use RLlib that do this to you automatically https://docs.ray.io/en/latest/rllib/rllib-models.html#variable-length-complex-observation-spaces

prasuchit commented 10 months ago

Hi @adysonmaia, thanks for the prompt response, I've come across the rllib version for padding and I'm familiar with the padding technique, however, from experience I've seen that this takes the algorithm to a local optima when the max number of agents are high.

In cases where say only a small subset of agents play the game for the entire duration, we would actually be solving a much larger model with the padded observations. While the actions can be pruned using masking, it still takes a long time to learn and is prone to local optima.

adysonmaia commented 10 months ago

Hi @prasuchit, Another option is to use multi-agent reinforcement learning where each player will be an agent with fixed observation and action spaces.

adysonmaia / sb3-plus

Dynamic action and observation sizes #1