Unity-Technologies / ml-agents

The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents using deep reinforcement learning and imitation learning.
https://unity.com/products/machine-learning-agents
Other
17.17k stars 4.15k forks source link

Doubt: Tensorflow computations in GPU in docker #1365

Closed icaro56 closed 6 years ago

icaro56 commented 6 years ago

Hi,

Training ML Agents Agents on a Linux machine with docker speeds training time? If the machine has CUDA capability, are the tensorflow calculations done in GPU?

awjuliani commented 6 years ago

Hi @icaro56,

You will have to be sure to install tensorflow-gpu instead to tensorflow, which is installed by default due to our requirements for the mlagents package. Currently the implementation of PPO we use does not take great advantage of a GPU. Only if you were to be large visual observations would you typically find an advantage from using GPU.

icaro56 commented 6 years ago

@awjuliani ,

I create a image on docker hub for tensorflow gpu: https://hub.docker.com/r/icaro56/ml-agents_images/tags/

But this image does not work. I change tensorflow to tensorflow-gpu.

It is happen this error when I try to run the train:

root@1cdd415e4b12:/workspace/unity-volume# mlagents-learn ./trainer_config.yaml --env=Bomberman --run-id=bomberman_test --train Traceback (most recent call last): File "/usr/local/bin/mlagents-learn", line 7, in from mlagents.trainers.learn import main File "/usr/local/lib/python3.6/site-packages/mlagents/trainers/init.py", line 4, in from .models import * File "/usr/local/lib/python3.6/site-packages/mlagents/trainers/models.py", line 5, in import tensorflow.contrib.layers as c_layers ModuleNotFoundError: No module named 'tensorflow.contrib.layers'

mneilly commented 6 years ago

If you have an NVIDIA gpu you can give this Dockerfile a try and see if it works for you. It is derived from the Unity docker image but uses nvidia/cudagl/cudnn and nvidia-docker2. It will let you train using the GPU (in a headless mode if desired).

https://github.com/mneilly/linux-unity-ml-agents-nvidia-docker

icaro56 commented 6 years ago

Thanks @mneilly . I will try to use this.

icaro56 commented 6 years ago

I am having problem of timeout. Look:

gpg: keyring `/tmp/tmp.M92x8F85ox/secring.gpg' created
gpg: keyring `/tmp/tmp.M92x8F85ox/pubring.gpg' created
gpg: requesting key AA65421D from hkp server keyserver.ubuntu.com
gpg: keyserver timed out
gpg: keyserver receive failed: keyserver error
gpg: requesting key AA65421D from hkp server ha.pool.sks-keyservers.net
gpg: keyserver timed out
gpg: keyserver receive failed: keyserver error
gpg: requesting key AA65421D from hkp server pgp.mit.edu
gpg: keyserver timed out
gpg: keyserver receive failed: keyserver error
gpg: requesting key AA65421D from hkp server keyserver.pgp.com
gpg: keyserver timed out
gpg: keyserver receive failed: keyserver error
The command '/bin/sh -c export GNUPGHOME="$(mktemp -d)"     && (        gpg --keyserver keyserver.ubuntu.com --recv-keys "$GPG_KEY"        || gpg --keyserver ha.               pool.sks-keyservers.net --recv-keys "$GPG_KEY"        || gpg --keyserver pgp.mit.edu --recv-keys "$GPG_KEY"        || gpg --keyserver keyserver.pgp.com --recv-ke               ys "$GPG_KEY"        )  && gpg --batch --verify python.tar.xz.asc python.tar.xz         && rm -rf "$GNUPGHOME" python.tar.xz.asc        && mkdir -p /usr/src/pyth               on      && tar -xJC /usr/src/python --strip-components=1 -f python.tar.xz       && rm python.tar.xz             && cd /usr/src/python   && gnuArch="$(dpkg-archit               ecture --query DEB_BUILD_GNU_TYPE)"     && ./configure          --build="$gnuArch"              --enable-loadable-sqlite-extensions             --enable-shared -               -with-system-expat              --with-system-ffi               --without-ensurepip     && make -j "$(nproc)"   && make install         && ldconfig             &               & apt-get purge -y --auto-remove $buildDeps             && find /usr/local -depth               \(                      \( -type d -a \( -name test -o -name test               s \) \)                         -o                      \( -type f -a \( -name '*.pyc' -o -name '*.pyo' \) \)           \) -exec rm -rf '{}' +  && rm -rf /usr/sr               c/python' returned a non-zero code: 2
mneilly commented 6 years ago

Well... if none of the key servers are responding then it looks like you are having an issue with the network and will need to try again later...

icaro56 commented 6 years ago

I change the address to use the port 80 and it works.

gpg --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys "$GPG_KEY" \ || gpg --keyserver hkp://ha.pool.sks-keyservers.net:80 --recv-keys "$GPG_KEY" \ || gpg --keyserver hkp://pgp.mit.edu:80 --recv-keys "$GPG_KEY" \ || gpg --keyserver hkp://keyserver.pgp.com:80 --recv-keys "$GPG_KEY" \

I hope the image works now! :)

icaro56 commented 6 years ago

The image is working now. But as @awjuliani spoke, "currently the implementation of PPO we use does not take great advantage of a GPU".

icaro56 commented 5 years ago

Hi. Try to use the tags gpu or cpu icaro56/ml-agents_images:cpu icaro56/ml-agents_images:gpu

I am using these in my research.

@icaro56 Thanks but it does not work. It gave me:

Docker image path: index.docker.io/icaro56/ml-agents_images:latest ERROR MANIFEST_UNKNOWN: manifest unknown

maystroh commented 5 years ago

Thanks for your quick reply @icaro56 . I figured it out and I deleted my message in a hurry. I'm sorry for this naive question. BTW, I'm currently working on building a docker image that includes tensorflow-gpu + nvidia-driver and X-Server in order to do the training with visual observation on a server machine. Have you ever done it before? I'm encountering some issues to build the X-Server, i.e. I'm following what it is mentioned in https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Training-on-Amazon-Web-Service.md. Here is the DockerFile I have so far. I still can't figure it out. Can anyone please help?

icaro56 commented 5 years ago

Have you ever done it before?

No, I have not.

I use only vector observations. And the version of docker with gpu, practically has the same speed of the version docker with cpu. :(

The @mneilly maybe can help you.

maystroh commented 5 years ago

Ok @icaro56 Thanks.

icaro56 commented 5 years ago

@maystroh , I made new training with ml-agents with tensorflow-gpu and cpu, and what I'm seeing, the tensorflow-cpu are training faster than the GPU.

The machine I use has 8 gpu cards in parallel.

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.