How to take advantage of GPU training on a Linux Server (AWS) and Docker

meiemari commented 2 years ago

Hi team,

I am running a training via Docker on a Linux Server (headless), like this:

mlagents-learn trainer.yml --env=env-dist/UnityApp.x86_64 --run-id=$RUN_ID --force --debug --no-graphics --num-envs=10

The training time is significantly longer (somethimes 2 - 3 times) compared to training locally on a Windows machine (no Docker).

I am training in the Cloud on AWS G instances which have at least: g4dn.xlarge 1 (GPU) 4 (vCPU) 16 (GB Memory). My local laptop beats the server every time!

Am I missing something?

My Unity setup: Unity version: 2020.3.20f1 ML Agents version: 1.0.8 Thanks in advance!

meiemari commented 2 years ago

Additional information. I am using vector observations only, but still training in the cloud is much slower than on my local laptop. Any hints as to how I can solve this riddle are appreciated :) I am running headless, using xvfb (Linux Server with Docker). I have read through older posts and it seems like using xvfb could be a bottleneck https://github.com/Unity-Technologies/ml-agents/issues/1846.

Is there an alternative to xfvb that I can use by now?

kenminglee commented 2 years ago

Hey @meiemari, from my little experience with mlagents, I would like to suggest a few potential improvements to your setup that could improve training speed:

The use of Docker could be a severe bottleneck in the setup. Even though Docker is considerably lightweight, containerization still adds some significant overhead from my experience. Since you are running your experiments on AWS, why not run mlagents directly on the instance (i.e., installing mlagents through pip/conda on the AWS instance directly)?
On xvfb -- since you are running your environment in no-graphics mode, you do not need xvfb. Xvfb is usually used for when agents in the environment requires visual observation (i.e., rendering). Moreover, even if you plan to train visual-obs based agents in the future, you could also install xvfb directly on the AWS instance, without the need to use Docker.
It may not be advisable to have the number of parallel environments >> number of cores. There is usually diminishing returns in the number of parallel envs, and too many could hurt performance.
Might also be advisable to check that GPU is being utilized (e.g., check nvidia-smi while training), and perhaps try SAC instead if you were using PPO.

I have limited experience in mlagents, and in machine learning in general, so please take these suggestions with a grain of salt!

doctorpangloss commented 2 years ago

Your docker container must be started with GPUs. xvfb-run started processes cannot use the GPU for graphics.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had activity in the last 28 days. It will be closed in the next 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 2 years ago

This issue has been automatically closed because it has not had activity in the last 42 days. If this issue is still valid, please ping a maintainer. Thank you for your contributions.

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Unity-Technologies / ml-agents

How to take advantage of GPU training on a Linux Server (AWS) and Docker #5677