Closed HanwGeek closed 5 years ago
Hi @HanwGeek -- you should be able to use ML-Agents within a Ray cluster / with RLLib. Unfortunately I can't give much advice on the best way to do so at this time. You can look at the Ray logs or the Unity Player logs on the workers to try to find out more specifically why your connection is failing.
I check the log but there is only one line showed below:
Desktop is 0 x 0 @ 0 Hz
By the way, I train my model on Ubuntu server without GUI.
Yep, looks like you're running into the issue that Unity requires xserver to render on Linux. You have a couple of options.
First off, if you're using only vector observations (no visual), you can use the "no graphics" mode to avoid the need for xserver. Alternatively, you can use xvfb to perform CPU rendering of your visual observations. If you're using the ML-Agents gym wrapper this is an option you can pass to the constructor. Finally, you could use a GPU and install drivers / xserver. This is a trickier option, but it will have the best performance.
Yep, looks like you're running into the issue that Unity requires xserver to render on Linux. You have a couple of options.
First off, if you're using only vector observations (no visual), you can use the "no graphics" mode to avoid the need for xserver. Alternatively, you can use xvfb to perform CPU rendering of your visual observations. If you're using the ML-Agents gym wrapper this is an option you can pass to the constructor. Finally, you could use a GPU and install drivers / xserver. This is a trickier option, but it will have the best performance.
I already set the no_graphics=True
at the first time. But it seems didn't work. Please tell me how to use a GPU to do the trick? Thanks a lot!
Hmm, no_graphics
should have worked. If you aren't using visual observations I'd highly recommend giving that another try (and reporting back if there are any other errors). That said, if you want to try to set up XServer we don't have a general guide but this might help: https://github.com/Unity-Technologies/obstacle-tower-env/blob/master/examples/gcp_training.md#set-up-xserver
I am not sure what kind of cluster setup you're using, but this was only tested on Google Cloud -- hope it helps.
Hmm,
no_graphics
should have worked. If you aren't using visual observations I'd highly recommend giving that another try (and reporting back if there are any other errors). That said, if you want to try to set up XServer we don't have a general guide but this might help: https://github.com/Unity-Technologies/obstacle-tower-env/blob/master/examples/gcp_training.md#set-up-xserverI am not sure what kind of cluster setup you're using, but this was only tested on Google Cloud -- hope it helps.
When I train the agent alone, it worked very well. But when I train it in the cluster in ray
lib by UCB, it comes out the took too long to respond
problem. I think it maybe a process communication problem.
@HanwGeek There is no change in the communication when you use no_graphics
-- I'd recommend trying it again and looking into the logs to see what went wrong.
@harperj It seems no log at ~/.config/unity3d/company/product/. Where can I see logs?
Sorry @HanwGeek -- I'm not sure. This might depend on your setup and what component of the system failed. Unfortunately I'm unable to help much with how ML-Agents is used with external tools.
That's OK. I think it might be a complex problem with muti-libs which is hard to fix.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
We custom a
UnityEnv
to run our RL algorithm in a cluster of 8 machine. In our case, we use lib ray to setup and manage the cluster. However, when we run thetrain.py
, the log outputs are as follows:I want to know whether ml-agents can be trained and deployed in cluster.