Training Killed by OS - Githubissues

Kautenja commented 6 years ago

Sometimes the script dddqn_train.py is killed by Ubuntu. Not sure if this is an issue causes by memory limitations? There should be plenty of memory for this setup, but perhaps Ubuntu kills this process for some reason. The other alternative is some sparse edge case between the Python and Lua script that is hard to reproduce

Lua thread bombed out: ...ckages/gym_super_mario_bros/lua/super-mario-bros.lua:12: bad argument #1 to 'find' (string expected, got nil)
Emulation speed 100.0%
[1]    1819 killed    python3 dddqn_train.py SuperMarioBrosNaFrameskip results

oddly, the command doesn't match what was actually issued. This is a peculiar bug.

Kautenja commented 6 years ago

The last training session confirms that this is the result of running out of memory on Ubuntu. During the last session, Ubuntu failed to kill the process and the replay queue filled the entirety of the 32GB of available system memory causing complete system unresponsiveness. This likely results from the downsampler changing from (84, 84) to (100, 100) image size. This change has been regressed to the original (84, 84). It would be convenient to calculate the memory requirements of the replay queue on initialization, then raise an error if the requirements exceed the amount of system memory based on some threshold.

Kautenja commented 6 years ago

With the new nes-py back-end in use by the gym-super-mario-bros package, this issue should be resolved. Closing for now.

Kautenja / playing-mario-with-deep-reinforcement-learning

Training Killed by OS #29