Unity-Technologies / ml-agents

The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents using deep reinforcement learning and imitation learning.
https://unity.com/products/machine-learning-agents
Other
17.19k stars 4.16k forks source link

[RLlib] Train in clusters. #2364

Closed HanwGeek closed 5 years ago

HanwGeek commented 5 years ago

We custom a UnityEnv to run our RL algorithm in a cluster of 8 machine. In our case, we use lib ray to setup and manage the cluster. However, when we run the train.py, the log outputs are as follows:

ray.exceptions.RayTaskError: ray_RolloutWorker:sample() (pid=45151, host=host1)
  File "/home/racer/anaconda3/envs/py36/lib/python3.6/site-packages/ray/rllib/evaluation/rollout_worker.py", line 253, in __init__
    self.env = _validate_env(env_creator(env_context))
  File "/home/racer/anaconda3/envs/py36/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 788, in <lambda>
    register_env(name, lambda config: env_object(config))
  File "/home/racer/DeepCampEnv_ver1/UnityEnv.py", line 26, in __init__
    self.env = UnityEnvironment(file_name=env_config["env_directory"], worker_id=env_config["worker_id"], seed=1, no_graphics=env_config["no_graphics"])
  File "/home/racer/DeepCampEnv_ver1/ml-agents/ml-agents/mlagents/envs/environment.py", line 67, in __init__
    aca_params = self.send_academy_parameters(rl_init_parameters_in)
  File "/home/racer/DeepCampEnv_ver1/ml-agents/ml-agents/mlagents/envs/environment.py", line 527, in send_academy_parameters
    return self.communicator.initialize(inputs).rl_initialization_output
  File "/home/racer/DeepCampEnv_ver1/ml-agents/ml-agents/mlagents/envs/rpc_communicator.py", line 61, in initialize
    "The Unity environment took too long to respond. Make sure that :\n"
mlagents.envs.exception.UnityTimeOutException: The Unity environment took too long to respond. Make sure that :
         The environment does not need user interaction to launch
         The Academy and the External Brain(s) are attached to objects in the Scene
         The environment and the Python interface have compatible versions.

I want to know whether ml-agents can be trained and deployed in cluster.

harperj commented 5 years ago

Hi @HanwGeek -- you should be able to use ML-Agents within a Ray cluster / with RLLib. Unfortunately I can't give much advice on the best way to do so at this time. You can look at the Ray logs or the Unity Player logs on the workers to try to find out more specifically why your connection is failing.

HanwGeek commented 5 years ago

I check the log but there is only one line showed below:

Desktop is 0 x 0 @ 0 Hz

By the way, I train my model on Ubuntu server without GUI.

harperj commented 5 years ago

Yep, looks like you're running into the issue that Unity requires xserver to render on Linux. You have a couple of options.

First off, if you're using only vector observations (no visual), you can use the "no graphics" mode to avoid the need for xserver. Alternatively, you can use xvfb to perform CPU rendering of your visual observations. If you're using the ML-Agents gym wrapper this is an option you can pass to the constructor. Finally, you could use a GPU and install drivers / xserver. This is a trickier option, but it will have the best performance.

HanwGeek commented 5 years ago

Yep, looks like you're running into the issue that Unity requires xserver to render on Linux. You have a couple of options.

First off, if you're using only vector observations (no visual), you can use the "no graphics" mode to avoid the need for xserver. Alternatively, you can use xvfb to perform CPU rendering of your visual observations. If you're using the ML-Agents gym wrapper this is an option you can pass to the constructor. Finally, you could use a GPU and install drivers / xserver. This is a trickier option, but it will have the best performance.

I already set the no_graphics=True at the first time. But it seems didn't work. Please tell me how to use a GPU to do the trick? Thanks a lot!

harperj commented 5 years ago

Hmm, no_graphics should have worked. If you aren't using visual observations I'd highly recommend giving that another try (and reporting back if there are any other errors). That said, if you want to try to set up XServer we don't have a general guide but this might help: https://github.com/Unity-Technologies/obstacle-tower-env/blob/master/examples/gcp_training.md#set-up-xserver

I am not sure what kind of cluster setup you're using, but this was only tested on Google Cloud -- hope it helps.

HanwGeek commented 5 years ago

Hmm, no_graphics should have worked. If you aren't using visual observations I'd highly recommend giving that another try (and reporting back if there are any other errors). That said, if you want to try to set up XServer we don't have a general guide but this might help: https://github.com/Unity-Technologies/obstacle-tower-env/blob/master/examples/gcp_training.md#set-up-xserver

I am not sure what kind of cluster setup you're using, but this was only tested on Google Cloud -- hope it helps.

When I train the agent alone, it worked very well. But when I train it in the cluster in ray lib by UCB, it comes out the took too long to respond problem. I think it maybe a process communication problem.

harperj commented 5 years ago

@HanwGeek There is no change in the communication when you use no_graphics -- I'd recommend trying it again and looking into the logs to see what went wrong.

HanwGeek commented 5 years ago

@harperj It seems no log at ~/.config/unity3d/company/product/. Where can I see logs?

harperj commented 5 years ago

Sorry @HanwGeek -- I'm not sure. This might depend on your setup and what component of the system failed. Unfortunately I'm unable to help much with how ML-Agents is used with external tools.

HanwGeek commented 5 years ago

That's OK. I think it might be a complex problem with muti-libs which is hard to fix.

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.