Open Sohojoe opened 5 years ago
@ervteng Can you speak to this? We had validated internally that we were using GCP/GPU.
Hi @Sohojoe, the obstacle-tower-env doesn't have a TensorFlow requirement as it doesn't install ml-agents. You can check the GPU usage with nvidia-smi
. What type of GPU are you running locally?
@ervteng - I have a GTX 1080 locally. How many training steps per second do you see?
running nvidia-smi
shows that it is using the GPU so I wass wrong:
I guess the default tensorflow does not include cpu optimizations and that is why it shows the warning:
@ervteng what do you mean by
the obstacle-tower-env doesn't have a TensorFlow requirement as it doesn't install ml-agents.
? Then what does this mean in the README?
Requirements
The Obstacle Tower environment runs on Mac OS X, Windows, or Linux.
Python dependencies (also in setup.py):
Unity ML-Agents v0.6
OpenAI Gym
Pillow
Also I remember that my tensorflow version was overwritten with 1.7.1 when running pip install -e .
from this repo. Although I re-installed 1.9.0 and found that there was no problem running the obstacle tower environment...
@kwea123 obstacle tower installs a special version of ml-agents that doesn't specify tensorflow in its' install requirements file.
obstacle tower does need tensorflow to run.
The normal ml-agents specifies tensorflow 1.7.x as this is required for running the trained models from within until. obstacle tower doesn't need this.
@Sohojoe Oh, I see. Sorry for the misunderstanding @ervteng
@Sohojoe you are correct, the Readme is wrong (and we'll fix it). The newest versions of OTC no longer uses ML-Agents in its entirety, and doesn't require TensorFlow. Dopamine does require TensorFlow, but as far as I know will work with most recent versions.
I'm getting about 45.61 steps per second on a T4 on GCP, but it's using only about 10% of the GPU. In our past testing, we found that the OTC environment tends to be CPU-bound. What CPU do you have on your desktop machine? I'm curious to see how we can get the environments training faster.
I have an i7-8700k @ 3.7GHz which has 6 processors / 12 cores
A big help to performance would be to support multiple instances of the environment within the Unity level. I regularly train with 128 concurrent agents and I'm reading some papers where they go up to 2048. I made a modification to ml-agents in my dev branch of marathon-envs which enables one to set --num-agents=128 in the command line to specify the number of agents. I would be happy to work on a PR. But, it does require the environment to work relative to its spawn position.
I have also been working on adapting large-scale-curiosity to work with obstacle tower as it supports instancing via MPI. I have been able to get it training on windows at 400-500 fps but it is not learning yet. Also, MPI on windows is not very stable and I've only been able to get 16-24 instances running (but this should not be a problem on linux servers). My code is here
When following the GCP Tutorial - I see Tensorflow warning that the version of Tensorflow is not optimized for the cpu.
Given that the cloud instance does include the optimized version of Tensorflow, I wonder if installing obstacle-tower-env overrides the optimized version. If this is the case, then it may mean it has installed the unoptimized CPU only Tensorflow as ml-agents has the requirement
'tensorflow>=1.7,<1.8'
The training speed seems slow: 56 steps per second compared with 130 steps per second on my home pc: