Unity-Technologies / ml-agents

The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents using deep reinforcement learning and imitation learning.
https://unity.com/products/machine-learning-agents
Other
17.13k stars 4.15k forks source link

Self-play environment / ghost trainer #2559

Closed mbaske closed 4 years ago

mbaske commented 5 years ago

Hi - I'd like to share a table football environment which I built for experimenting with adversarial self-play. https://github.com/mbaske/ml-table-football I was able to train a model using @LeSphax's ghost trainer modification. Looking forward to seeing this cool new feature in a future ml-agents release!

awjuliani commented 5 years ago

Thanks for sharing @mbaske! This is a really cool environment. Can you share a little about your experience using this feature? Did the two sets of agents reach an equilibrium you were happy with?

mbaske commented 5 years ago

Thanks @awjuliani. It looks like two agents' ELO ratings converge over time. I need to interrupt and resume training (CPU) every couple of hours though, because of the python memory leak @AcelisWeaven is describing in the PR thread. After doing so, the rewards and ELO ratings can change quite a bit. Sometimes the checkpoint file will keep previous model checkpoint paths, sometimes they seem to disappear and the trainer then isn't sampling past checkpoints as it did before interrupting. I'm also getting some tensorboard glitches, so I can't always see the exact training progress.

Overall, the ghost trainer works much better than my manual approach to self-play though. I initially started with training an agent against a simple heuristic, then duplicated the model and pitted the two copies against each other. Whenever they diverged enough, I would pause, discard the weaker model, duplicate the stronger one and repeat.

So far I haven't seen more complex behaviour like passes, but that's hopefully a matter of training time. I'm currently at around 12M steps with my most promising looking ghost trained model, and will keep training it for a while longer. Also, I'm really just guessing with regard to the network size. Since the number of observations and actions is small (58 -> 8), I used 2x256 hidden units.

AcelisWeaven commented 5 years ago

Hi @mbaske! Yes I still have that memory leak, and I'm not versed enough in Python/ML to fix it myself :) What I did was to wrap my training command in a infinite bash loop. I keep Tensorboard running in the background and until it crashes (about every 7 days), the graphs are fine. I mentionned it in a comment on the PR but the brain order is important to have more consistent graphs.

Anyway, I've been training many models since I started (almost non stop) but I found it hard to have good results. In my game, if my rewards are sparse, the AI can score but can't figure how to use the items on the map (and that was with 20+ days of training) You may have the same problem with your passes. The solution I've found is not satisfying, but it works fine for my game: I also reward item usage. For example, if an enemy is trapped I give +1/-1 reward to the players. If a 'magnet' item is used, you get a reward depending on the ball velocity relative to the other team goal. Etc... Here's a video of a model that has been trained in a few days: https://youtu.be/MJwEhCY01kw Here's another one with a 12h-old model (~400k steps): https://youtu.be/RSEL708Inbc

In these models, I only have an hidden layer of size 180. Training is a lot faster and give better results that a two/three layers model in the same timeframe.

Also a downside of the ELO rating is that it only works on 1v1. But I think this rating is not mandatory and that removing it could help with crafting multi-agents teams. OpenAI used self-play on an hide-and-seek game and got awesome results, and I'm not sure they actually use a rating system at all: https://openai.com/blog/emergent-tool-use/ (Paper)

Sorry for the long comment, I'm just sharing some thoughts! @LeSphax's fork actually get the job done and I'd be glad to see more development regarding self-play.

LeSphax commented 5 years ago

Hello everyone,

Glad to see this PR was useful to you, I agree your environments look really cool :)

I have got time to work on this again in the next few weeks. So I am planning to update the PR to the latest develop branch and then look into the multi-agents team problem. I think something based on elo rating could work, since competitive video games often use that. I am not sure if trueskill would be better. It feels like it just adds a parameter to tell us how uncertain we are about a player skill level. Since we don't regularly add agents with unknown skill level into the game, I feel like this uncertainty is always the same.

@AcelisWeaven You mentioned removing the rating. At the moment the rating doesn't affect behavior at all, it's just there to indicate if the agents are improving or not. You should be able to disable it by removing use_elo_rating: true from the brain config. I would guess that openAI used a rating system there as well even if they mostly talk about the behaviors. I don't really see a better indicator of skill in a zero sum game. Though adding a way to track specific behaviors with heuristics would be helpful instead of just watching the agents play. For example, you could imagine a tensorboard graph telling you how often the items are picked up in your game. Maybe it's possible to do that currently with custom protobuf messages.

@awjuliani It would be helpful to hear your thoughts on this feature as well. Do you feel like this is something that could be integrated into the repo? What improvements would we need to do that?

awjuliani commented 5 years ago

@LeSphax

We are still quite interested in the contribution! We are planning on taking a look at is as part of a wider look at multi-agent, which we will be taking soon.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had activity in the last 14 days. It will be closed in the next 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed because it has not had activity in the last 28 days. If this issue is still valid, please ping a maintainer. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.