Best way to organize self-play

alex-petrenko commented 5 years ago

I am planning to experiment with population-based training and self-play, similar to the recent DeepMind's Q3 CTF paper. The obvious requirement would be the ability to train the agents to play against other agent copies on the same map at the same time.

I could probably wrap a multiplayer session into a single multi-agent interface and use ASYNC_PLAYER mode, maybe with increased tickrate (https://github.com/mwydmuch/ViZDoom/issues/209) However the optimal way to implement this would be to render multiple observations for different agents within the same tick in the same process in synchronous mode, similar to how it's done in single-player.

Any thoughts on what is the right course of action here? Does multi-agent SYNC mode seem feasible or would it require changing half the codebase?

Miffyli commented 5 years ago

We experimented a bit with self-play in 2018 edition of the competition and didn't get anywhere with it (but that could be just our shabby agent codes which didn't work). At the moment I do not have access to the codes/notes to comment further, but I recall stuff like training being much slower and running into deadlocks and whatnot (sometimes player or two stopped responding to ViZDoom's commands, sometimes they segfaulted, etc). But, we did run longer experiments with smaller games, so this could be feasible as long you are fine with the ASYNC_PLAYER way. I can give better comments once I get hand on the notes/code again.

alex-petrenko commented 5 years ago

Hi, @Miffyli! Thanks a lot for the feedback! So, as far as I understand, you used the central server (game host) with a bunch of ASYNC_PLAYER clients connected to it? Did you run the host in synchronous or asynchronous mode?

My current plan is to host a game in sync (PLAYER) mode, run the separate game instances for clients in the same process, or in different threads/processes. Then, after the host does the step() it will send some signal (e.g. cond var) to all other clients so that they can do their step. If this works, the game can proceed in the normal RL step-by-step fashion, provided that clients and the server are able to communicate their state between steps. At this point I am not sure whether I should run clients in sync or async mode. I expect this to be much slower than normal singleplayer gameplay, but at least in this setup all participants are equal, collect an equal amount of frames and no one gets the host advantage. Not sure if that will work though.

Miffyli commented 5 years ago

Actually now I remember we also used SYNC mode, because nowadays SYNC mode is supported for multiplayer as well. In that sense you should be good to go.

And with the SYNC mode you should be fine with just step-ing every environment once to get "equal" observations from each agent. I will get hand of the code later/tomorrow, so I can give better comments and what worked and what didn't :)

alex-petrenko commented 5 years ago

That'd be awesome! I would love to chat more about that, so I'll be waiting for your feedback. There's also a relevant post on DoomWorld forums about this project: https://www.doomworld.com/forum/topic/106770-ai-research-and-doom-multiplayer/?tab=comments#comment-2000068

Miffyli commented 5 years ago

I went through our code, and indeed it does not seem to be anything else than just setting Mode to PLAYER and then host/join games like in the cig_multiplayer examples. There are few quirks, though:

We limited total number of players per game to 8 (including bots). IIRC the game could end up stuck with more players.
All agents were in separate processes (multiprocessing.Process), each of which had their own instance of ViZDoom environment.
We recreated all agents/vizdoom envs after finishing a game (killed the processes and created new ones). There are also a bunch of checking if processes (agents, workers, envs) are still alive. IIRC these were for fighting against occasional random deadlocks/crashes we weren't able to fix properly.

This project should be doable with ViZDoom, but you have to be careful with the networking as it was tad fragile. I would also manually check the states/observations from ViZDoom envs to see if they make sense (all agents progressed by X timesteps, agents executed correct actions, etc).

alex-petrenko commented 5 years ago

Ok, I guess I started to realize how this all works. So when you do game.init() in Python it actually spawns a separate Doom process. Therefore, regardless of whether the mode is sync or async, the host is always listening on the network port, regardless of the "agent loop". This is what allows it to function in the sync mode.

If you implement this "multi-agent" loop naively, I think there's nothing to guarantee that game state updates are broadcasted to all the clients before the next game step is made. I want all my clients to have the latest information about the game without any lag, but if the sync mode loop is too fast, there might be just no time for the changes to propagate to clients between steps.

A proper way to fix it would be to introduce some part of the game state that is synchronized between server and client and can be queried through Python interface, something like "state ID". Then the process is the following: 1) On the host: advance the game 1 step and change the state ID to something unique, e.g. increment the counter 2) Broadcast the changed State ID to the Python client processes, through some IPC means, not through Doom networking 3) On the clients: query the State ID variable until it is matching the latest State ID from the server. After this, advance the game one step and send the signal to the host that you're done.

Edit: FYI there's a simple example on how to do it in repository actually! https://github.com/mwydmuch/ViZDoom/blob/1e00b8d05ed58f439f317a24fc7efdf77d6eedea/examples/python/multiple_instances_advance.py This one relies on sleeping for a few ms on the clients before advancing the state

Miffyli commented 5 years ago

If you use PLAYER mode (sync) with multiplayer, every player will wait till other players have made their actions before proceeding (the ViZDoom code syncs this). After asking @mihahauke , the sleeps in the code you linked are used to emulate the latency from running computations (that can take random amount of time), and to demonstrate that the code works even with random delays. Apparently if clients go out of sync, there should be an error.

alex-petrenko commented 5 years ago

@Miffyli thank you so much, your input was very helpful! I was able to write an implementation of Vizdoom multi-agent environment following the RLLib interface: https://ray.readthedocs.io/en/latest/rllib-env.html#multi-agent-and-hierarchical

The only problem I have is that I am not able to use make_action(..., skip) where skip is >1 I usually train with skip == 4 which gives a great performance, but in the multi-agent scenario the game always gets stuck sooner or later. With skip == 2 it happens very rarely, with skip == 3 quite often, with skip == 4 - almost instantaneously. I believe one of the clients gets stuck somewhere in DoomController::waitForDoomWork() and then the entire thing gets stuck forever, not able to advance. I don't know enough about how Doom multiplayer works, but indeed it seems to be very brittle. Either server gets ahead of the client by a couple of frames and can't handle it, or maybe the network packet is lost and never resent. Any thoughts on that? Do you think there's any chance this can work with skip_frames=4?

The other approach to multi-agent Doom, which is a multiple viewpoint rendering within the same game instance, seems even more tempting now. I know that Doom has a split-screen feature, so it should already be supported by the engine on some level: https://www.youtube.com/watch?v=k-fjc8hZaJA Still I think this is a big effort - basically you'd need to make an 8-way split screen in a multiplayer match.

alex-petrenko commented 5 years ago

Ok, I was able to partially work around that by manipulating the update_state flag. This seems to be saving a lot of time by just not transmitting around the screen buffers we won't use due to frame skipping.

With the update_state trick I get around 3200 environment steps per second for 8 agents, so ~400 steps/sec per agent, or ~100 observations/sec/agent. Without update_state it's about 1900FPS max. I believe the old Doom networking is probably the bottleneck now. Another idea would be to dig into Doom source code and replace network with something fast and local, like shared memory, pipes, etc.

Here's my current implementation: https://gist.github.com/alex-petrenko/5cf4686e6494ad3260c87f00d27b7e49 This example is not self-contained but should be enough to get the idea)

alex-petrenko commented 5 years ago

https://youtu.be/dHGSZRFTnf0

Miffyli commented 5 years ago

I also recall networking being a bottleneck when we run our things (something-something processes waited for messages something). Unless I am terribly mistaken, ZDoom uses the original P2P networking which does not cope well when you try to squeeze in many players or a lot of frames.

The implementation looks quite nifty and compact, though! With bit of tidying up it could be a nice example. Perhaps also a small write-up on challenges/requirements/limitations of multi-agent ViZDoom training, @mwydmuch ?

alex-petrenko commented 5 years ago

Definitely happy to collaborate! I am planning to tidy it up and get rid of "vizdoomgym" dependency at some point (or maybe we should actually make something like this in VizDoom repo, because for every project I end up using some kind of Gym wrapper). After this, it can probably be added to examples.

Performance-wise, I was able to push pure experience collection to around ~11500 environment frames on a 10-core 20-thread CPU. So this is ~2850 observations/sec with 16 parallel workers, each running an 8-agent environment, so 128 Doom processes in total. Doom renders at the 256x144 resolution, later downsampled to 128x72. Standard 160x120 -> 84x84 will give additional ~5% improvement, but I am sticking to widescreen now. For comparison, I am getting ~39000 FPS on the same machine with ~200 parallel Doom envs in singleplayer mode.

So it is evident that old ZDoom p2p networking is a major bottleneck. I might look into ways to speed this up, but not right now.

Miffyli commented 5 years ago

"Official" gym environment in this repo would be nice indeed, but integrating it in the code is not too straight forward, since all ViZDoom code is done in C/C++. There has been separate repos for Gym envs, but they have died/quieted down. Easy way could be to add just an Python example that implements Gym API.

The numbers sound promising! Naturally Google had a lot of hardware to throw at their training, but I am willing to bet in local machines (like your 10-core) using ViZDoom is much faster :). I would like to hear when/if you get any results!

alex-petrenko commented 5 years ago

Thanks for the encouragement, definitely will share results!

Side note: it looks like the reward mechanism is behaving weirdly in multiplayer. If I give my agent reward of +1 every tick, e.g. in cig.acs script:

global int 0:reward;

script 2 ENTER
{
    while(1)
    {
        reward += 1.0;
        delay(1);
    }
}

then all of my agents also get the reward for everyone else. E.g. if I have 8 agents in the environment, then every single agent gets +8 reward every tick. I assume this is just how it works because ACS variables are synchronized between all clients, but maybe this should be documented somewhere. Bottom line: standard reward mechanism through ACS script does not work with multiplayer. I will have to look at the game variables in Python code to get my reward for kills, medkit pickups, etc.

Please correct me if I am wrong :)

Miffyli commented 5 years ago

Ah yes, forgot to mention this. Indeed the ACS scripts behave wonky, quite possibly due to reason like you said (it dies the += 1 for each agent etc) :). However for deathmatching you can quite easily implement the reward systems by tracking the variables, and with the label/map buffer you should be able to get most of the information for auxiliary rewards if you want to use them (e.g. is crosshair above enemy).

alex-petrenko commented 3 years ago

BTW, this worked out. This repo contains full implementation of working multi-agent training in VizDoom: https://github.com/alex-petrenko/sample-factory

Maxwell2017 commented 3 years ago

Excuse me, have you ever used RLlib to wrapper your project “sample-factory” in the game vizdoom? Or use the project vizdoomgym? @alex-petrenko

alex-petrenko commented 3 years ago

@Maxwell2017 did not use any of that. All of the code is handwritten, and is available at the URL above. This command should start the 1v1 duel training (multi-agent):

python -m algorithms.appo.train_appo --env=doom_duel --train_for_seconds=360000 --algo=APPO --gamma=0.995 --env_frameskip=2 --use_rnn=True --num_workers=72 --num_envs_per_worker=16 --num_policies=8 --ppo_epochs=1 --rollout=32 --recurrence=32 --batch_size=2048 --res_w=128 --res_h=72 --wide_aspect_ratio=False --benchmark=False --pbt_replace_reward_gap=0.5 --pbt_replace_reward_gap_absolute=0.35 --pbt_period_env_steps=5000000 --with_pbt=True --pbt_start_mutation=100000000 --experiment=doom_duel_full

Maxwell2017 commented 3 years ago

In fact, I want to use a reinforcement learning framework, which is compatible with TF and pytorch, and can integrate various game environments. I can also implement the algorithm myself, or use the built-in algorithm. Do you have any good suggestions? @alex-petrenko

Miffyli commented 3 years ago

@Maxwell2017 Implement an OpenAI Gym interface over ViZDoom (see old example here). This will make it easy to use ViZDoom with existing libraries like stable-baselines, and easier to implement algorithms.

alex-petrenko commented 3 years ago

@Maxwell2017 @Miffyli I guess no shame in advertising our codebase, so consider this: https://github.com/alex-petrenko/sample-factory/blob/master/envs/doom/doom_gym.py This is an OpenAI Gym wrapper for VizDoom. There are other implementations out there, but most of them are not up to date, or lack functionality, or have bugs. This thing is battle-tested, you can train your agents to this level of performance.

SampleFactory also comes with a multi-agent VizDoom wrapper, which is a pretty non-trivial thing to implement.

Besides SampleFactory, there are not that many frameworks that support multi-agent training and self-play out of the box. One other option is RLLib which you can configure for multi-agent training, but keep in mind that experiments will be a lot slower (3-10x difference depending on the environment). Basically, an attempt to train these bots with RLLib led to the development of SampleFactory, because RLLib was just a bit too slow for that (it is a pretty powerful codebase otherwise).

If you're not interested in multi-agent learning, you can try any other RL frameworks. stable-baselines or rlpyt are good exampes. I would still consider using wrappers from SampleFactory, this will save you a lot of time.

Maxwell2017 commented 3 years ago

Does the repo ViZDoomGymhttps://github.com/shakenes/vizdoomgym have the same doom version with gym-doomhttps://github.com/ppaquette/gym-doom? In fact, I don’t know the difference between the new and the old version. Can you point me out?@Miffyli

Maxwell2017 commented 3 years ago

There is another question. I found in your paper (https://arxiv.org/pdf/2006.11751.pdf) that there is a vizdoom experiment based on rllib. I am a newcomer to reinforcement learning. I would like to know whether based on rllib, can I use the built-in methods of RLlib (such as ppo, dqn) to train one with good results Robot? But I don’t know what the benchmark is good or bad?Should I refer to the results in the vizdoom paper? I can search for methods implemented by others on github, but these methods seem to have tricks, which are inconsistent with the built-in algorithms. Looking forward to your reply, thank you very much！ @alex-petrenko

alex-petrenko commented 3 years ago

@Maxwell2017 In our paper, we only used RLLib for performance measurements, but you definitely can train good policies in VizDoom with RLLib albeit it will be slower. In particular, I used PPO, APPO and IMPALA methods, they all had similar performance. The only caveat is that I wasn't able to get good performance out of RNN-based (GRU/LSTM) policies, that could've been some bug which is probably fixed now. I didn't have such problems with SampleFactory.

Maxwell2017 commented 3 years ago

@alex-petrenko When using rllib, I know that I need to package the game environment (use doom_gym in the sample-factory and complete the registration), set the policy (use the built-in algorithm), and set the loss. Is there anything else I need to pay attention to during use?

alex-petrenko commented 3 years ago

Yeah, that's pretty much it. Just copy-paste SampleFactory gym implementation (or install it as a local pip package from sources), and set up training parameters. You might need to modify the default convnet model as well, but also should not be hard.

Maxwell2017 commented 3 years ago

Thank you, I'll try it now：）

alex-petrenko commented 3 years ago

@Maxwell2017 first of all, what paper and what experiment in the paper are you referring to? Your question boils down to "why my RL algorithm does not work?" This is very hard to answer without learning details about the task, your reward function, the model architecture, the algorithm and the implementation you are using, the hyperparameters and the training schedule you are using, etc.

In short, RL is still more art than a mature technology. You generally can't expect to just plug in a environment to a learning system and expect it to work right away 100% of the time. Things need tuning.

If you're trying to reproduce a result from SampleFactory paper, I suggest that you try to do this with SampleFactory first.

Maxwell2017 commented 3 years ago

In fact, the paper I refer to is "vizdoom: a doom based AI research platform for visual reinforcement learning ", in the basic experiment, I refer to the neural network architecture and learning settings in the paper, The policy uses dqn with RLlib. And now In this repo（ViZDoom），i cant find the complete network structure which is same with the paper. I don't know the step size of the conv layer. I set it to 1 by default.@Miffyli
In rllib, I use register_env and register_custom_model defines the game environment and network structure. In today's experiment, I found that sometimes the reward can reach 80 (close to the index in the paper), but then the reward will decline, very unstable. I'm very sorry, I think it must be that I didn't set it correctly in some places.@alex-petrenko

alex-petrenko commented 3 years ago

I see, so you're talking about "basic" scenario. The full reward function subtracts 1 point for every action, so 80 is actually close to the maximum reward that can be expected (i.e. a monster killed only in 20 steps). I think ~80 is the performance of the optimal policy.

If you are using these rewards directly, the first thing I'd do is to reduce the reward scale. I believe in SampleFactory we used 0.01 scale for these rewards, i.e. 101 turns into 1.01, -5 turns into -0.05, etc. Neural networks used in training typically have a much easier time learning from small quantities like that.

пт, 26 февр. 2021 г. в 01:50, Maxwell2017 notifications@github.com:

In fact, the paper I refer to is "vizdoom: a doom based AI research platform for visual reinforcement learning ", in the basic experiment, I refer to the neural network architecture and learning settings in the paper, The policy uses dqn with RLlib. And now In this repo（ViZDoom），i cant find the complete network structure which is same with the paper. I don't know the step size of the conv layer. I set it to 1 by default.@Miffyli https://github.com/Miffyli In rllib, I use register_env and register_custom_model defines the game environment and network structure. In today's experiment, I found that sometimes the reward can reach 80 (close to the index in the paper), but then the reward will decline, very unstable. I'm very sorry, I think it must be that I didn't set it correctly in some places.@alex-petrenko https://github.com/alex-petrenko

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mwydmuch/ViZDoom/issues/391#issuecomment-786537217, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJ6HLY3CCFYXXEBY7PVKVLTA5VHZANCNFSM4HZ474QQ .

Maxwell2017 commented 3 years ago

Good suggestion for scaling reward ! Strangely enough, after reward reaching 80, the mean reward gradually decreased and finally decreased to around - 280, which is very surprising. I also have a basic question: how much will the network structure affect the effect of dqn? I refer to the dqn https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html implemented by pytorch, because I found that the network structure in the vizdoom paper is not clear. I don't know if this is OK?

Miffyli commented 3 years ago

Increasing the size usually leads to slower learning (but probably better results) later on. For the basic scenario, the network defined in the example file should eventually reach close to optimal reward. Generally DQN's network (the one you linked) works well in simple scenarios (including vizdoom scenarios), and only becomes a bottleneck in large scales.

I suggest you look at OpenAI Spinning Up for practical information on DRL.

Maxwell2017 commented 3 years ago

Sorry to see your reply now @Miffyli . The network defined in the example file ，you mean examples/python/learning_tensorflow.py#L232 , examples/python/learning_pytorch.py#L152 and examples/python/test_pytorch.py#L51?What I understand is that any of them can work well in basic scenarios. Although some are Duel DQN architecture。 One more question, Is there any Medikit Collecting Experiment example( especially Game Settings in Section IV-B2-b ) that I can refer to, because I saw that a shaping reward needs to be set.

Miffyli commented 3 years ago

What I understand is that any of them can work well in basic scenarios

Ah, I just recalled that some of those were updated just recently... Yes, even a simple/small network should learn the basic environment (the original theano code has such a small network). A larger network might actually make it much slower (more parameters to tune).

One more question, Is there any Medikit Collecting Experiment example

Which paper are you referring to? You can find the scenario file in scenarios directory. If you want an example code that learns in that scenario, you can modify the example learning code to support the scenario. Learning better policies in health-gathering supreme requires providing the current health information to network that needs bit more modifications (see e.g. this comment). Some of the papers related to the competition ran experiments with these environments (see references in this paper). This paper also used health-gathering task.

Maxwell2017 commented 3 years ago

In fact, the paper I refer to is "vizdoom: a doom based AI research platform for visual reinforcement learning ", I find the Game Settings in Section IV-B2-b, but I don’t know how to set it in the DQN network. In addition, in the vizdoom_hgs_test.zip you gave，i found living_reward = 0.01 and death_penalty = 1 in health_gathering_supreme.cfg, is that the method（reducing the reward scale） alex-petrenko proposed？so i need not modify my code but just the cfg? : )@Miffyli

Miffyli commented 3 years ago

but I don’t know how to set it in the DQN network.

Ah, it seems the only change to other experiments is use of RMSProp. Note that the example code in this repo not the code used in the paper (I do not think it is available). I recommend you use existing implementation of DQN for your experiments, e.g. the one from stable-baselines/stable-baselines3.

is that the method（reducing the reward scale） alex-petrenko proposed？

Yup! This is exactly what he suggested.

Maxwell2017 commented 3 years ago

I recommend you use existing implementation of DQN for your experiments

Maybe I didn’t explain it clearly. In fact, I want to know that in the paper I refer to, "The nonvisual inputs (health, ammo) were fed directly to the first fully-connected layer", how should it be combined with visual inputs in DQN? This confuses me ：(

Miffyli commented 3 years ago

Ah, right. The traditional way to do is to concatenate such 1D features into the feature vector that comes out from CNN (inside the network). In the example PyTorch code you would do something like (around line 200):

x = self.conv3(x)
x = self.conv4(x)
x = x.view(-1, 192)
# Combine picture-features and 1D features into one vector
x = th.concat((x, your_1d_features), dim=0)
# Note that these would need changing as well...
x1 = x[:, :96]  # input for the net to calculate the state value
x2 = x[:, 96:]  # relative advantage of actions in the state

You can find cleaner implementations in Unity ML code, rllib or experimental stable-baselines3 PR for supporting so called "dictionary observations" (see the comment I linked above and related PR).

Maxwell2017 commented 3 years ago

x = th.concat((x, your_1d_features), dim=0)

For Health in game_variables, it is a scalar (in observation_space it should be defined as spaces.Box(0, np.Inf, (1,)) or obtained through self.game.get_available_game_variables() ), so here I can directly cancat it with the features after convolutions, Am I understand right? : )

Miffyli commented 3 years ago

Yes, you can concatenate the health information in that spot (but remember to adjust the other code around it). The example code is very hardcoded for basic.py so, again, I recommend taking a look at established libraries.

Maxwell2017 commented 3 years ago

I am sorry to see your reply now.

x = th.concat((x, your_1d_features), dim=0)

As shown in the code above, concatenating 1-D health info with the convolutional feature, I tried it with neural network using in basic scenario , and it didn’t seem to converge. Is this related to the input image size? I am using 64x64 input and I have reduced the reward scale. Is it possible that I did not use shaping rewards, such as 100 and -100 points for collecting a medikit and a vial respectively? Is it necessary? Or should I train more times? ：）

Miffyli commented 3 years ago

@Maxwell2017

Health gathering supreme is muuuuuch harder task than the basic one, especially if you do not add the said reward shaping (giving/negating reward upon picking up medkit/vial). I recommend you adding this reward shaping to see if it starts to learn anything (+1 reward for medkit, -1 for vial, everything else is zero). Even with this it might take hundreds of thousands of steps to train.

Note that generally it is hard to say if "method X should learn scenario Y", especially if it has not been done before. You might need to tune hyperparameters to get it working.

Maxwell2017 commented 3 years ago

@Miffyli In fact, I use health_gathering.cfg instead of health_gathering_supreme.cfg. These two should be both more difficult than basic.cfg? I referred here https://github.com/glample/Arnold/blob/86af06d2fdb35c4bf552ecacfe8fe6ac1abd8cd4/src/doom/game.py#L191, Since I am using RLlib, using reward shaping means that I have to add similar operations（as the code i referred） to my own gym wrapper, right?

Miffyli commented 3 years ago

Yes, best way to do reward shaping with Gym environments is through wrappers. Both the health_gathering.cfg should be very easy to learn, even without the reward shaping. health_gathering_supreme is way more difficult and I recommend starting with shaping.

Trinkle23897 commented 3 years ago

Hi @alex-petrenko , I'm currently doing vectorized multi-agent vizdoom env but encountered the same issues (the different thing is that I used C++ interface):

newEpisode() may hang if I launch multiple multi-agent env and reset at beginning. Should I avoid calling newEpisode() right after game.init()?
advanceAction() has the same behavior as above even if I use skip==1

I dig into #417 and also sample-factory's source code but still cannot find either a reason or a solution for the above cases, do you have other insights?

alex-petrenko commented 3 years ago

Hi! I fixed these in my fork https://github.com/alex-petrenko/ViZDoom/commits/doom_bot_project This is why SampleFactory instructs installing VizDoom from this branch. Specifically, take a look at commits 2d54 and e451.

This was supposed to be merged into Vizdoom as a part of Sound-RL project (https://github.com/mwydmuch/ViZDoom/pull/486/files) but looks like we forgot to include it. @hegde95 @Miffyli I think we should totally merge that!

Alternatively, @Trinkle23897 I would also really appreciate if you could submit these as a PR if they fix your issue!

alex-petrenko commented 3 years ago

@hegde95 merged the changes in #486 You might want to consider switching to that branch since it is newer and will soon be merged into master.

Farama-Foundation / ViZDoom

Best way to organize self-play #391