Closed kncrane closed 4 years ago
Hello, Linking the issues you mentioned: https://github.com/hill-a/stable-baselines/issues/308 and https://github.com/hill-a/stable-baselines/issues/201
I'm not that surprised for PPO, and if I remember the default hyperparameters of HalfCheetahBulletEnv-v0
is one environment with a small network ([64, 64]).
You have different bottlenecks:
the speed also depends on the low-level libraries (MKL / OpenBLAS) installed on your machine.
However, if you use SAC/TD3 with bigger network ([256, 256] or [400, 300]), you will see the benefit of having a GPU.
Thanks for fast reply, good to know all those things, will read up a bit more.
Wasn't particularly bothered about the training taking 2 hr 30, but had assumed I would benefit from (or indeed be forced to) move to the cloud when start working with more complex environments / models and was surprised/concerned I wasn't seeing a speed up with this simple example.
Thanks again
If you want to make the training faster (but not more sample efficient), you can use more envs (cf doc).
You may also take a look at MPI for ppo: https://twitter.com/hardmaru/status/1260852988475658242
Describe the question Hi, would anyone be able to clarify something for me. I have ran the script below and it took 1 hr 45 to run locally on my laptop, and 2 hr 30 to run on a GPU enabled Azure VM, which is the opposite of what I was expecting to happen, I thought I would get a speed up. Why is that? I have had a look at issue #201 and #308 to try and answer this myself but am not confident enough in my understanding to come to a conclusion. I understand that low GPU utilisation is normal when using a powerful GPU, and that only the gradient updates are performed on the GPU - environment simulation is done on the CPU and can therefore create a bottleneck. Trying to figure out if this outcome is expected (for some reason such as an overhead I'm not aware of or because the CPUs on my laptop are better than the virtual ones and therefore not seeing the benefits of using a GPU), if it's a VM config problem (appreciate need to get help from elsewhere in that case), or if I am missing something in my script / am not using rl-baselines-zoo properly.
Code Example
System Info My laptop is a Lenovo Thinkpad T480s with Intel® Core™ i7-8550U Processor (4 cores, 8 threads, 1.80 GHz processor base frequency) running 64-bit Ubuntu 18.04.4 LTS.
The Azure VM is a Standard_NC6 from the NC-series, with 6 vCPUs and one NVIDIA Tesla K80 GPU. I installed CUDA 10.0 and libcudnn 7.4.2.24-1+cuda10.0 and checked GPU availability with
tf.config.experimental.list_physical_devices('GPU')
On both I have tensorflow 1.14.0 (tensorflow-gpu on VM) and python 3.6.8.
Additional During training on the VM, I ran the command
nvidia-smi
which returned 5 python GPU processes and typically 42% under Volatile GPU-Util.