MagmaMultiAgent / MagMA

Scalable RL Solution for Lux AI S2
MIT License
2 stars 0 forks source link

Modify architecture so that actions are generated per unit #56

Closed GergelyMagyar closed 6 months ago

GergelyMagyar commented 10 months ago
  1. Create observations per unit (obs_wrappers.py)
  2. Create actions per unit and calculate log probs (policies.py)
  3. Store new actions (buffers.py)
  4. Create actions masks per unit (sb3_action_mask.py)
GergelyMagyar commented 10 months ago

There are 2 directions we can go

  1. The action is the actions of the whole team
  2. The action is the action of an entity

  1. With the current implementation we are at (1.). We can do the same here. The observation space would be the observations of the entities, where with a batch dimension we could include observations for all units/factories in a single tensor. The output would also be a tensor, where each element of the batch is an action for the given entity.

    Pros:

    • similar to current implementation

      Cons:

    • wasted memory, since if we give global information to the entities we have to include it for every entity in the batch
    • no local reward, only global
    • we have to work with multiplied probabilities, and this might be an issue for PPO, because we have to allow larger model changes
    • changing observation space size. Because in different steps we have different number of units. So in one step we might have to save a (4, 100) matrix, in another a (13, 100) matrix, etc. We have the same problem for the action dimensions. One way to solve this is to always use the maximum possible agent size and just make the unused rows all 0, but this would require an unnecessary large amount of memory. We could instead set a minimum dimension, and at each step we would only learn from N entities, ignoring the others. This is also not ideal.

  1. Another option is to treat every single unit and factory as a separate observation-action pair. In one step we would collect observations from multiple entities and generate an action for each of them. This would mean that in one step we would collect multiple timesteps worth of data. This is both a good and a bad thing. We would only have to modify how stable baselines saves observations and actions and how the rollout is collected. I think this is the better direction. There is already something similar implemented with PettingZoo and SuperSuit, so I will look into this. One problem we might have is that in our case the number of entities are changing. This means that in one step we will collect from N trajectories and in another from M trajectories. This could cause a problem for both the implementation and the training, since I'm not sure what affect on the training changing trajectories could have.

    Pros:

    • can use local reward (and if we go with this path we really should, because an entity's action could be reinforced only because its teammate did something and not the entity itself)
    • no need to mess with multiplied probabilities and PPO KL divergence limits
    • more experience from the same amount of train steps, possibly faster training and less resource requirements

      Cons:

    • have to implement, or try to get the pettingzoo and supersuit implementation to work with our setup
GergelyMagyar commented 9 months ago

After experimenting a bit with the codebase I managed to modify stable baselines so that we can have observations with changing sizes. This was non-trivial, since originally the observations were stored as large pytorch tensors or numpy arrays, with extra dimensions for the steps and environments. But if we want to save an observation vector (or matrix, if we want spatial data information in the future) for every factory and unit, we run into the problem that (if let's say we have a 100 features) sometimes we have a 4 entities and have a (4, 100) matrix, and sometimes we have 13 entities and a (13, 100) matrix. And the difference can happen in 1 timestep too, since we are running multiple environments simultaneously to speed up training.

We decided to not pursue direction (2.), since this would be a computational nightmare. We would have to keep track of the trajectory of every entity separately. It is much easier to treat 1 timestep as a single action, by summing up the log probabilities of the different entity actions. We saw other teams from previous seasons doing the thing with the probabilities in their implementations, so this approach should work.

Another important thing is that with this setup we can have both local (entity-level) and global observations. It is probably best to keep the actor and critic nets separate. The actor net should work with local information and the value net with global information. The motivation for this is simply the fact that is would be hard to deal with the changing size of local information when we want to create 1 value for the whole. Maybe we could do it by averaging a value output for each entity, or something similar, but for now it's easier to keep them separate.

With a very few simple observations and aggressive masking the agent managed to learn to mine ice. The next goal is to learn more complex behaviors.

GergelyMagyar commented 9 months ago

Update: Transfer action is working now. Units can mine ice and then give it to a factory. Unfortunately right now this doesn't happen frequently, this could be because a couple of reasons.

  1. Agents might be blocking each other. We set a 1 pixel radius for each agent that other agents can't go near. This is to stop them from colliding all the time. However I discovered that the dimensions were flipped in the method that calculates the coordinates, so it might work now with 0 radius. But there is no cap currently on how many agents the factory can produce, so there might be a lot of agents sabotaging each other.
  2. The net might not be complex enough. Right now it's only 1 layer (no hidden), it's basically a logistic regression with a softmax. That was enough for mining ice, but it might not be enough anymore.
  3. Agents might be out of power.
  4. PPO hyperparameters could be tweak as well, however the action should happen more frequently by random in my opinion, so debugging should come first.

A bit of investigation is needed.