Closed meiemari closed 2 years ago
Additional information. I am using vector observations only, but still training in the cloud is much slower than on my local laptop. Any hints as to how I can solve this riddle are appreciated :) I am running headless, using xvfb (Linux Server with Docker). I have read through older posts and it seems like using xvfb could be a bottleneck https://github.com/Unity-Technologies/ml-agents/issues/1846.
Is there an alternative to xfvb that I can use by now?
Hey @meiemari, from my little experience with mlagents, I would like to suggest a few potential improvements to your setup that could improve training speed:
no-graphics
mode, you do not need xvfb. Xvfb is usually used for when agents in the environment requires visual observation (i.e., rendering). Moreover, even if you plan to train visual-obs based agents in the future, you could also install xvfb directly on the AWS instance, without the need to use Docker.nvidia-smi
while training), and perhaps try SAC instead if you were using PPO.I have limited experience in mlagents, and in machine learning in general, so please take these suggestions with a grain of salt!
Your docker container must be started with GPUs. xvfb-run
started processes cannot use the GPU for graphics.
This issue has been automatically marked as stale because it has not had activity in the last 28 days. It will be closed in the next 14 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 42 days. If this issue is still valid, please ping a maintainer. Thank you for your contributions.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Hi team,
I am running a training via Docker on a Linux Server (headless), like this:
mlagents-learn trainer.yml --env=env-dist/UnityApp.x86_64 --run-id=$RUN_ID --force --debug --no-graphics --num-envs=10
The training time is significantly longer (somethimes 2 - 3 times) compared to training locally on a Windows machine (no Docker).
I am training in the Cloud on AWS G instances which have at least: g4dn.xlarge 1 (GPU) 4 (vCPU) 16 (GB Memory). My local laptop beats the server every time!
Am I missing something?
My Unity setup: Unity version: 2020.3.20f1 ML Agents version: 1.0.8 Thanks in advance!