bzier / gym-mupen64plus

An OpenAI Gym environment wrapper for the Mupen64Plus N64 emulator
MIT License
91 stars 39 forks source link

Always-increasing memory from emulator #80

Closed roclark closed 3 years ago

roclark commented 3 years ago

Spent some time on this project over the holiday period and noticed that I would run out of memory on my system after training an agent for several hours. I found that the mupen64plus emulator continually uses more system memory over time and eventually uses all available memory when running with multiple agents. Here's what I've found so far:

  1. Build a container using the steps in the README.
  2. Launch a container and update the verifyEnv.py script to run for 10000+ steps in one of the loops.
  3. Run the program with python verifyEnv.py.
  4. In another terminal, exec into the running container.
  5. Inside the container, run top and view the memory usage for mupen64plus. As the script runs, the memory usage will continue to increase. While it starts off small, the increases add up and can eventually exhaust resources for multi-agent training setups.

I recognize this might not be specific to what's done in this project, and perhaps this is just a feature of mupen64plus, but I don't really know where else to being to look. The closest that I've found in some GitHub issues on mupen's organization is that some debug logs could be allocating memory, but I have no idea how to debug this.

Thanks again for this awesome project, BTW. I hope to spend some more time with it this year.

bzier commented 3 years ago

Thanks for this @roclark. I did some testing today and at this point, I believe the memory leak may be in the mupen64plus-input-bot.

I ran the emulator (not this project, not using any gym env) using the 'standard' mupen64plus-input-sdl plugin. This plugin is the one that would normally read from the keyboard/mouse or gamepad, if you were playing 'manually'. The memory usage did not increase by much over the course of ~10 minutes.

Then I tried running with the mupen64plus-input-bot plugin. In this case, the plugin connects to an HTTP server and requests the controller values. Since I wasn't running the full gym environment, I started a simply 'noop' python server, which just returns controller values of all 0s for each request. Running in this way, even for a short period of time (5-10 min), the memory utilization of the emulator process steadily increased to much higher numbers than I saw with the SDL plugin.

I am taking a look at that code now, and looking for memory leaks, or other ways we can optimize that plugin. For a while, I've been thinking it would be a good idea to re-use the socket connection, rather than opening and closing the socket for each request (the emulator polls the controller on each 'frame'). However, I'm not sure that the socket connections have anything to do with the leak (we do call close on the file descriptor). Either way, it may be a decent optimization.

Another couple ideas I've had are:

Both of those are more fundamental changes, and would be better discussed in an issue on that repository instead of here.

In terms of the memory leak, there is probably some allocation that isn't being cleaned up properly, and there isn't much code in the plugin (see plugin.c and controller.c). Most of the code in the plugin file is just called once during startup/initialization. The code that is executed repeatedly is the read_controller method. This is where I'm focused.

bzier commented 3 years ago

FYI, I've narrowed down the leak to the JSON parsing (line 119 here). I'm going to take a look and see if this can be improved.

bzier commented 3 years ago

Turns out the JSON object just needed to be marked to be freed. It is a one-line fix. I have run the emulator for 10+ minutes again, with negligible increase in memory utilization.

Since this change is in the input plugin, I will need to apply the fix there, and then update this repo with the new plugin version. Unfortunately, right now this repository is a bit of a mess. I need to update a couple branches and get some PRs merged. The latest input plugin has changes to support multiple controllers, which I've handled in a branch. I'm not sure when I'll get these things aligned, but I will at least update this issue with the link to the input plugin fix once it is done.

roclark commented 3 years ago

Awesome work @bzier, that's great! I see the pull request has already been accepted. I will play around with that locally and test how things go. I'd be happy to help with some updates if you need it to. Thanks again!

bzier commented 3 years ago

You're welcome. Thanks for pointing this out. I have had trouble with my training agents crashing on occasion, and I've never taken the time to figure out why, since it was infrequent and easily restarted/continued. I am suspecting that this memory leak was likely the cause, so I'm excited to have it addressed.

Theoretically, you should just be able to update the Dockerfile with the sha of the latest input plugin commit: 0a1432035e2884576671ef9777a2047dc6c717a2, and then rebuild the docker image. Depending on which branch you're based on, this may or may not just work. Let me know if you have trouble with it.

As far as updating things go, I think I just need to take an afternoon and finally do it. I've just been putting it off for a long time and need to take the time to deal with it. I appreciate the offer for help, and I'll let you know if there's anything you can do.

roclark commented 3 years ago

Just pulled that hash into my Dockerfile and rebuilt and my mupen64plus memory has stayed relatively constant in a new training session, so I'd say this solves my issue! Thanks again for the great work! 😃

I certainly know the feeling about putting it off - I'm certainly guilty of that on a number of occasions with some of my personal projects.