Develop Branch: Can't Run learn.py on Windows 10 (commit c1bad57)

Phong13 commented 5 years ago

I have a 16 core threadripper so was very excited to see that training a single brain with multiple environments is now possible on the develop branch as described here:

https://github.com/Unity-Technologies/ml-agents/issues/828

I checked out the develop branch (commit c1bad57). First I noticed that mlagents-learn was not updated on this branch with the new parameter num-envs. However the parameter did exist in the learn.py file.

I tried to "pip install" from this develop branch but ran into some issues with a missing python class. Eventually I "pip uninstalled" both mlagents and mlagents-envs and set the PYTHONPATH too point to the develop checkout folders.

Now I can run learn.py. However it launches with an error:

Launch using:

python ml-agents\mlagents\trainers\learn.py config\trainer_config.yaml --run-id=BugTest --train

INFO:mlagents.trainers:{'--base-port': '5005', '--curriculum': 'None', '--debug': False, '--docker-target-name': 'None', '--env': 'None', '--help': False, '--keep-checkpoints': '5', '--lesson': '0', '--load': False, '--no-graphics': False, '--num-envs': '1', '--num-runs': '1', '--run-id': 'BugTest', '--save-freq': '50000', '--seed': '-1', '--slow': False, '--train': True, '': 'config\trainer_config.yaml'} Traceback (most recent call last): File "ml-agents\mlagents\trainers\learn.py", line 279, in main() File "ml-agents\mlagents\trainers\learn.py", line 264, in main run_training(0, run_seed, options, Queue()) File "ml-agents\mlagents\trainers\learn.py", line 81, in run_training env = SubprocessUnityEnvironment(env_factory, num_envs) File "F:\Workspace\Unity\ml-agents-develop-multi-train\ml-agents-envs\mlagents\envs\subprocess_environment.py", line 80, in init self.envs.append(self.create_worker(worker_id, env_factory)) File "F:\Workspace\Unity\ml-agents-develop-multi-train\ml-agents-envs\mlagents\envs\subprocess_environment.py", line 89, in create_worker child_process.start() File "C:\Users\iande\Anaconda3\envs\ml-agents\lib\multiprocessing\process.py", line 105, in start self._popen = self._Popen(self) File "C:\Users\iande\Anaconda3\envs\ml-agents\lib\multiprocessing\context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "C:\Users\iande\Anaconda3\envs\ml-agents\lib\multiprocessing\context.py", line 322, in _Popen return Popen(process_obj) File "C:\Users\iande\Anaconda3\envs\ml-agents\lib\multiprocessing\popen_spawn_win32.py", line 65, in init reduction.dump(process_obj, to_child) File "C:\Users\iande\Anaconda3\envs\ml-agents\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'create_environment_factory..create_unity_environment'

(ml-agents) F:\Workspace\Unity\ml-agents-develop-multi-train>Traceback (most recent call last): File "", line 1, in File "C:\Users\iande\Anaconda3\envs\ml-agents\lib\multiprocessing\spawn.py", line 105, in spawn_main exitcode = _main(fd) File "C:\Users\iande\Anaconda3\envs\ml-agents\lib\multiprocessing\spawn.py", line 115, in _main self = reduction.pickle.load(from_parent) EOFError: Ran out of input

I am using Windows 10 Pro Python 3.6 with Anaconda Using CPU (not GPU)

I tested the master branch to see if it also gets this error. It does not. The master branch runs using learn.py as expected.

Confirmed that the error was introduced to the develop branch on April 3rd with commit e59eff4.

Develop branch previous to that works. The error appears to be related to starting a Process on line 89 of subprocess_environment.py. The create_environment_factory..create_unity_environment cannot be pickled in order to create the new Process.

harperj commented 5 years ago

Hi @Phong13 -- this is a known issue with the develop branch which we have now fixed but not yet merged into develop. I've added the fix to the release-v0.8 branch. Can you try the release branch and see if this issue is fixed for you? We'll merge this back into develop shortly.

Phong13 commented 5 years ago

Great news! Will try this later today when I am back at the office.

Phong13 commented 5 years ago

This appears to be working great!. Definitely a big speed boost!

harperj commented 5 years ago

Glad to hear it @Phong13! We'd love to see / hear what kind of speedups you end up reaching using the parallel environments if you'd ever like to share. We've noticed the best speedup on real games :-)

I'm going to close this issue since it seems we've addressed the original problem, but feel free to reopen or create another issue if you have any more trouble. The official release + publish to PyPI will be coming soon.

Phong13 commented 5 years ago

The speedup in my case is very impressive because I have a 16 core machine. I have not tried careful measurements, but training that used to take 48 hours to reach a particular reward is now taking 5-6 hours to reach the same reward.

On the down side, it doesn't seem as stable as it was when using one process. If I quit training using CTRL-C and try to resume it sometimes hangs or, after a while, crashes my machine. It seems to try to save but the save is corrupt and if I try to continue later it hangs or crashes. If I launch a fresh run then it works. Is there a less heavy-handed way to terminate training and save a partial run than CTRL-C? I don't mind if I have to wait a few minutes for the python process to finish its current tasks and save its work in a way that it will be able to safely resume later.

On Thu, Apr 11, 2019 at 8:30 PM Jonathan Harper notifications@github.com wrote:

Glad to hear it @Phong13 https://github.com/Phong13! We'd love to see / hear what kind of speedups you end up reaching using the parallel environments if you'd ever like to share. We've noticed the best speedups on real games :-)

I'm going to close this issue since it seems we've addressed the original problem, but feel free to reopen or create another issue if you have any more trouble. The official release + publish to PyPI will be coming soon.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unity-Technologies/ml-agents/issues/1920#issuecomment-482423737, or mute the thread https://github.com/notifications/unsubscribe-auth/AK1VEXsJ5m0XTv98099k-HReH2uGj9__ks5vf_3agaJpZM4coAI7 .

-- Ian Deane DigitalOpus http://digitalopus.ca Twitter: @DigitalOpus https://twitter.com/DigitalOpus

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Unity-Technologies / ml-agents

Develop Branch: Can't Run learn.py on Windows 10 (commit c1bad57) #1920