Why is PPO training slower on VM with GPU

kncrane commented 4 years ago

Describe the question Hi, would anyone be able to clarify something for me. I have ran the script below and it took 1 hr 45 to run locally on my laptop, and 2 hr 30 to run on a GPU enabled Azure VM, which is the opposite of what I was expecting to happen, I thought I would get a speed up. Why is that? I have had a look at issue #201 and #308 to try and answer this myself but am not confident enough in my understanding to come to a conclusion. I understand that low GPU utilisation is normal when using a powerful GPU, and that only the gradient updates are performed on the GPU - environment simulation is done on the CPU and can therefore create a bottleneck. Trying to figure out if this outcome is expected (for some reason such as an overhead I'm not aware of or because the CPUs on my laptop are better than the virtual ones and therefore not seeing the benefits of using a GPU), if it's a VM config problem (appreciate need to get help from elsewhere in that case), or if I am missing something in my script / am not using rl-baselines-zoo properly.

Code Example

import os
import subprocess
import datetime
import time
import multiprocessing as mp

seeds = [64,42,96,33,11]

time_start = datetime.datetime.now()
print("Starting training at {}".format(time_start))

def training_run(seed):
    print(seed)
    command = "python train.py --algo ppo2 --env HalfCheetahBulletEnv-v0 -n 2000000 --tensorboard-log /home/b3024896/tb_events/stable-baselines/ --verbose 1 --seed " + str(seed)
    subprocess.call(command, shell=True)
    print("Finished training run with seed {}".format(seed))

pool = mp.Pool(mp.cpu_count()-1)
pool.map_async(training_run, seeds).get()
pool.close()

time_end = datetime.datetime.now()
time_spent = time_end - time_start
time_spent_mins = time_spent.seconds / 60

print("Finished training")
print("Training took {} minutes".format(time_spent_mins))

System Info My laptop is a Lenovo Thinkpad T480s with Intel® Core™ i7-8550U Processor (4 cores, 8 threads, 1.80 GHz processor base frequency) running 64-bit Ubuntu 18.04.4 LTS.

The Azure VM is a Standard_NC6 from the NC-series, with 6 vCPUs and one NVIDIA Tesla K80 GPU. I installed CUDA 10.0 and libcudnn 7.4.2.24-1+cuda10.0 and checked GPU availability with tf.config.experimental.list_physical_devices('GPU')

On both I have tensorflow 1.14.0 (tensorflow-gpu on VM) and python 3.6.8.

Additional During training on the VM, I ran the command nvidia-smi which returned 5 python GPU processes and typically 42% under Volatile GPU-Util.

araffin commented 4 years ago

Hello, Linking the issues you mentioned: https://github.com/hill-a/stable-baselines/issues/308 and https://github.com/hill-a/stable-baselines/issues/201

I'm not that surprised for PPO, and if I remember the default hyperparameters of HalfCheetahBulletEnv-v0 is one environment with a small network ([64, 64]). You have different bottlenecks:

env computation on cpu to collect data (usually the limiting part of ppo)
data transfer from cpu to gpu
data synchronisation when using multiple env (not the case here apparently)

the speed also depends on the low-level libraries (MKL / OpenBLAS) installed on your machine.

However, if you use SAC/TD3 with bigger network ([256, 256] or [400, 300]), you will see the benefit of having a GPU.

kncrane commented 4 years ago

Thanks for fast reply, good to know all those things, will read up a bit more.

Wasn't particularly bothered about the training taking 2 hr 30, but had assumed I would benefit from (or indeed be forced to) move to the cloud when start working with more complex environments / models and was surprised/concerned I wasn't seeing a speed up with this simple example.

Thanks again

araffin commented 4 years ago

If you want to make the training faster (but not more sample efficient), you can use more envs (cf doc).

You may also take a look at MPI for ppo: https://twitter.com/hardmaru/status/1260852988475658242

araffin / rl-baselines-zoo

Why is PPO training slower on VM with GPU #83