GCP Tutorial may not be using the GPU

Unity-Technologies / obstacle-tower-env

Obstacle Tower Environment

Apache License 2.0

542 stars 125 forks source link

GCP Tutorial may not be using the GPU #52

Open Sohojoe opened 5 years ago

Sohojoe commented 5 years ago

When following the GCP Tutorial - I see Tensorflow warning that the version of Tensorflow is not optimized for the cpu.

Given that the cloud instance does include the optimized version of Tensorflow, I wonder if installing obstacle-tower-env overrides the optimized version. If this is the case, then it may mean it has installed the unoptimized CPU only Tensorflow as ml-agents has the requirement 'tensorflow>=1.7,<1.8'

The training speed seems slow: 56 steps per second compared with 130 steps per second on my home pc:

awjuliani commented 5 years ago

@ervteng Can you speak to this? We had validated internally that we were using GCP/GPU.

ervteng commented 5 years ago

Hi @Sohojoe, the obstacle-tower-env doesn't have a TensorFlow requirement as it doesn't install ml-agents. You can check the GPU usage with nvidia-smi. What type of GPU are you running locally?

Sohojoe commented 5 years ago

@ervteng - I have a GTX 1080 locally. How many training steps per second do you see?

running nvidia-smi shows that it is using the GPU so I wass wrong:

I guess the default tensorflow does not include cpu optimizations and that is why it shows the warning:

kwea123 commented 5 years ago

@ervteng what do you mean by

the obstacle-tower-env doesn't have a TensorFlow requirement as it doesn't install ml-agents.

? Then what does this mean in the README?

Requirements
The Obstacle Tower environment runs on Mac OS X, Windows, or Linux.

Python dependencies (also in setup.py):

Unity ML-Agents v0.6
OpenAI Gym
Pillow

Also I remember that my tensorflow version was overwritten with 1.7.1 when running pip install -e . from this repo. Although I re-installed 1.9.0 and found that there was no problem running the obstacle tower environment...

Sohojoe commented 5 years ago

@kwea123 obstacle tower installs a special version of ml-agents that doesn't specify tensorflow in its' install requirements file.

obstacle tower does need tensorflow to run.

The normal ml-agents specifies tensorflow 1.7.x as this is required for running the trained models from within until. obstacle tower doesn't need this.

kwea123 commented 5 years ago

@Sohojoe Oh, I see. Sorry for the misunderstanding @ervteng

ervteng commented 5 years ago

@Sohojoe you are correct, the Readme is wrong (and we'll fix it). The newest versions of OTC no longer uses ML-Agents in its entirety, and doesn't require TensorFlow. Dopamine does require TensorFlow, but as far as I know will work with most recent versions.

I'm getting about 45.61 steps per second on a T4 on GCP, but it's using only about 10% of the GPU. In our past testing, we found that the OTC environment tends to be CPU-bound. What CPU do you have on your desktop machine? I'm curious to see how we can get the environments training faster.

Sohojoe commented 5 years ago

I have an i7-8700k @ 3.7GHz which has 6 processors / 12 cores

A big help to performance would be to support multiple instances of the environment within the Unity level. I regularly train with 128 concurrent agents and I'm reading some papers where they go up to 2048. I made a modification to ml-agents in my dev branch of marathon-envs which enables one to set --num-agents=128 in the command line to specify the number of agents. I would be happy to work on a PR. But, it does require the environment to work relative to its spawn position.

I have also been working on adapting large-scale-curiosity to work with obstacle tower as it supports instancing via MPI. I have been able to get it training on windows at 400-500 fps but it is not learning yet. Also, MPI on windows is not very stable and I've only been able to get 16-24 instances running (but this should not be a problem on linux servers). My code is here