Closed jayyoung0802 closed 1 year ago
Hey - we're trying to test this on an A100 SXM setup (PCIe works on our end), but there's a few questions that can help, too:
$ nvidia-smi --query-gpu=index,driver_version,name --format=csv,noheader
$ find /usr/lib -iname '*egl*' | sort
$ find /usr/share/glvnd | sort
$ eglinfo
Okay, rendering is confirmed to work on a GCP instance with Ubuntu 20.04.
We'll need the answers above to diagnose further - here's example outputs for known working rendering:
Setup:
$ apt-get update
$ apt-get install nvidia-driver-515-server libopengl-dev mesa-utils-extra
$ reboot
Tests:
$ nvidia-smi --query-gpu=index,driver_version,name --format=csv,noheader
0, 515.65.01, NVIDIA A100-SXM4-40GB
$ find /usr/lib -iname '*egl*' | sort
/usr/lib/x86_64-linux-gnu/libEGL.so.1
/usr/lib/x86_64-linux-gnu/libEGL.so.1.1.0
/usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.0
/usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.65.01
...
$ find /usr/share/glvnd | sort
/usr/share/glvnd
/usr/share/glvnd/egl_vendor.d
/usr/share/glvnd/egl_vendor.d/10_nvidia.json
...
$ eglinfo
...
Device platform:
EGL API version: 1.5
EGL vendor string: NVIDIA
EGL version string: 1.5
EGL client APIs: OpenGL_ES OpenGL
...
Hey - we're trying to test this on an A100 SXM setup (PCIe works on our end), but there's a few questions that can help, too:
- Is this running on AWS?
- Are you using Docker on the host, or running the code directly?
- What's the Linux distro?
- Are all of the dependencies from this section installed?
- Can you provide the outputs of the following commands? (Both on the host and inside Docker if using Docker.)
$ nvidia-smi --query-gpu=index,driver_version,name --format=csv,noheader $ find /usr/lib -iname '*egl*' | sort $ find /usr/share/glvnd | sort $ eglinfo
Hi, I use Ubuntu 18.04.5 LTS(server verison, not desktop) to running the code directly. I install all of the dependencies except libegl-dev, libnvidia-gl and libopengl-dev.
$ nvidia-smi --query-gpu=index,driver_version,name --format=csv,noheader 0, 460.32.03, A100-SXM4-40GB $ find /usr/lib -iname '*egl*' | sort /usr/lib/x86_64-linux-gnu/libEGL_mesa.so.0 /usr/lib/x86_64-linux-gnu/libEGL_mesa.so.0.0.0 /usr/lib/x86_64-linux-gnu/libEGL.so /usr/lib/x86_64-linux-gnu/libEGL.so.1 /usr/lib/x86_64-linux-gnu/libEGL.so.1.0.0 /usr/lib/x86_64-linux-gnu/libwayland-egl.so.1 /usr/lib/x86_64-linux-gnu/libwayland-egl.so.1.0.0 $ find /usr/share/glvnd | sort /usr/share/glvnd /usr/share/glvnd/egl_vendor.d /usr/share/glvnd/egl_vendor.d/50_mesa.json $ eglinfo EGL client extensions string: EGL_EXT_device_base EGL_EXT_device_enumeration EGL_EXT_device_query EGL_EXT_platform_base EGL_KHR_client_get_all_proc_addresses EGL_EXT_client_extensions EGL_KHR_debug EGL_EXT_platform_wayland EGL_EXT_platform_x11 EGL_MESA_platform_gbm EGL_MESA_platform_surfaceless EGL_EXT_platform_device GBM platform: eglinfo: eglInitialize failed Wayland platform: error: XDG_RUNTIME_DIR not set in the environment. error: XDG_RUNTIME_DIR not set in the environment. error: XDG_RUNTIME_DIR not set in the environment. eglinfo: eglInitialize failed X11 platform: eglinfo: eglInitialize failed Device platform: eglinfo: eglInitialize failed
When i install libopengl-dev,
$ sudo apt-get install libopengl-dev Reading package lists... Done Building dependency tree Reading state information... Done E: Unable to locate package libopengl-dev
And my GPU driver is fixed(It means i cannot change the version of GPU driver), so i cannot install nvidia-driver-515-server too.
Hi, I use docker to train ppo on A100, but don't work,
I install all of the dependencies except libegl-dev, libnvidia-gl and libopengl-dev.
Is there any specific reason installing libnvidia-gl
is not an option?
That's a hard requirement for being able to do headless rendering - this set of requirements got us working on 18.04 in the past:
apt-get install mesa-utils-extra libglvnd-dev libnvidia-gl-460 --no-install-recommends
I use docker to train ppo on A100, but don't work.
Great - this is actually a bug we've seen before, but were unable to reproduce - can you send us some details about your host system and the Nvidia driver version? (Basically, Godot is crashing when trying to exit due to trying to free some memory twice.)
I install all of the dependencies except libegl-dev, libnvidia-gl and libopengl-dev.
Is there any specific reason installing
libnvidia-gl
is not an option?That's a hard requirement for being able to do headless rendering - this set of requirements got us working on 18.04 in the past:
apt-get install mesa-utils-extra libglvnd-dev libnvidia-gl-460 --no-install-recommends
I use docker to train ppo on A100, but don't work.
Great - this is actually a bug we've seen before, but were unable to reproduce - can you send us some details about your host system and the Nvidia driver version? (Basically, Godot is crashing when trying to exit due to trying to free some memory twice.)
Hi, I successfully use docker to train PPO with RLlib. As written in the notes, RLLib is slower to train. When I use Avalon RL library to train PPO, there is a bug as shown above. My host system is
$ cat /proc/version Linux version 3.10.0-957.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC) ) #1 SMP Thu Nov 8 23:39:32 UTC 2018
, and I build the docker image by Dockerfile which your provide. The Nvidia driver version is 460.32.03.
As @bawr mentioned above, installing the GL packages is required for Avalon to run. As for the the error in the screenshot - please open another issue with more details about your host system and the Nvidia driver version. Thanks!
Hi, My GPU Card is 'A100-SXM4-40GB', and it cannot render. So how can i run the training code? I try 'LIBGL_ALWAYS_SOFTWARE=True python -m avalon.agent.train_ppo_avalon' , but it don't work.