Avalon-Benchmark / avalon

A 3D video game environment and benchmark designed from scratch for reinforcement learning research
https://generallyintelligent.com/avalon/
GNU General Public License v3.0
180 stars 16 forks source link

How to train on GPU which does not support any of the supported OpenGL versions? #8

Closed jayyoung0802 closed 1 year ago

jayyoung0802 commented 1 year ago

Hi, My GPU Card is 'A100-SXM4-40GB', and it cannot render. So how can i run the training code? I try 'LIBGL_ALWAYS_SOFTWARE=True python -m avalon.agent.train_ppo_avalon' , but it don't work.

bawr commented 1 year ago

Hey - we're trying to test this on an A100 SXM setup (PCIe works on our end), but there's a few questions that can help, too:

  1. Is this running on AWS?
  2. Are you using Docker on the host, or running the code directly?
  3. What's the Linux distro?
  4. Are all of the dependencies from this section installed?
  5. Can you provide the outputs of the following commands? (Both on the host and inside Docker if using Docker.)
    $ nvidia-smi --query-gpu=index,driver_version,name --format=csv,noheader
    $ find /usr/lib -iname '*egl*' | sort
    $ find /usr/share/glvnd | sort
    $ eglinfo
bawr commented 1 year ago

Okay, rendering is confirmed to work on a GCP instance with Ubuntu 20.04.

We'll need the answers above to diagnose further - here's example outputs for known working rendering:

Setup:

$ apt-get update
$ apt-get install nvidia-driver-515-server libopengl-dev mesa-utils-extra
$ reboot

Tests:

$ nvidia-smi --query-gpu=index,driver_version,name --format=csv,noheader
0, 515.65.01, NVIDIA A100-SXM4-40GB
$ find /usr/lib -iname '*egl*' | sort
/usr/lib/x86_64-linux-gnu/libEGL.so.1
/usr/lib/x86_64-linux-gnu/libEGL.so.1.1.0
/usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.0
/usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.65.01
...
$ find /usr/share/glvnd | sort
/usr/share/glvnd
/usr/share/glvnd/egl_vendor.d
/usr/share/glvnd/egl_vendor.d/10_nvidia.json
...
$ eglinfo
...
Device platform:
EGL API version: 1.5
EGL vendor string: NVIDIA
EGL version string: 1.5
EGL client APIs: OpenGL_ES OpenGL
...
jayyoung0802 commented 1 year ago

Hey - we're trying to test this on an A100 SXM setup (PCIe works on our end), but there's a few questions that can help, too:

  1. Is this running on AWS?
  2. Are you using Docker on the host, or running the code directly?
  3. What's the Linux distro?
  4. Are all of the dependencies from this section installed?
  5. Can you provide the outputs of the following commands? (Both on the host and inside Docker if using Docker.)
$ nvidia-smi --query-gpu=index,driver_version,name --format=csv,noheader
$ find /usr/lib -iname '*egl*' | sort
$ find /usr/share/glvnd | sort
$ eglinfo

Hi, I use Ubuntu 18.04.5 LTS(server verison, not desktop) to running the code directly. I install all of the dependencies except libegl-dev, libnvidia-gl and libopengl-dev.

$ nvidia-smi --query-gpu=index,driver_version,name --format=csv,noheader
0, 460.32.03, A100-SXM4-40GB

$ find /usr/lib -iname '*egl*' | sort
/usr/lib/x86_64-linux-gnu/libEGL_mesa.so.0
/usr/lib/x86_64-linux-gnu/libEGL_mesa.so.0.0.0
/usr/lib/x86_64-linux-gnu/libEGL.so
/usr/lib/x86_64-linux-gnu/libEGL.so.1
/usr/lib/x86_64-linux-gnu/libEGL.so.1.0.0
/usr/lib/x86_64-linux-gnu/libwayland-egl.so.1
/usr/lib/x86_64-linux-gnu/libwayland-egl.so.1.0.0

$ find /usr/share/glvnd | sort
/usr/share/glvnd
/usr/share/glvnd/egl_vendor.d
/usr/share/glvnd/egl_vendor.d/50_mesa.json

$ eglinfo
EGL client extensions string:
EGL_EXT_device_base EGL_EXT_device_enumeration EGL_EXT_device_query
EGL_EXT_platform_base EGL_KHR_client_get_all_proc_addresses
EGL_EXT_client_extensions EGL_KHR_debug EGL_EXT_platform_wayland
EGL_EXT_platform_x11 EGL_MESA_platform_gbm
EGL_MESA_platform_surfaceless EGL_EXT_platform_device
GBM platform:
eglinfo: eglInitialize failed
Wayland platform:
error: XDG_RUNTIME_DIR not set in the environment.
error: XDG_RUNTIME_DIR not set in the environment.
error: XDG_RUNTIME_DIR not set in the environment.
eglinfo: eglInitialize failed
X11 platform:
eglinfo: eglInitialize failed
Device platform:
eglinfo: eglInitialize failed
jayyoung0802 commented 1 year ago

When i install libopengl-dev,

$ sudo apt-get install libopengl-dev                                                                                                                                                                                                                  
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package libopengl-dev

And my GPU driver is fixed(It means i cannot change the version of GPU driver), so i cannot install nvidia-driver-515-server too.

jayyoung0802 commented 1 year ago

Hi, I use docker to train ppo on A100, but don't work,

image
bawr commented 1 year ago

I install all of the dependencies except libegl-dev, libnvidia-gl and libopengl-dev.

Is there any specific reason installing libnvidia-gl is not an option?

That's a hard requirement for being able to do headless rendering - this set of requirements got us working on 18.04 in the past:

apt-get install mesa-utils-extra libglvnd-dev libnvidia-gl-460 --no-install-recommends

I use docker to train ppo on A100, but don't work.

Great - this is actually a bug we've seen before, but were unable to reproduce - can you send us some details about your host system and the Nvidia driver version? (Basically, Godot is crashing when trying to exit due to trying to free some memory twice.)

jayyoung0802 commented 1 year ago

I install all of the dependencies except libegl-dev, libnvidia-gl and libopengl-dev.

Is there any specific reason installing libnvidia-gl is not an option?

That's a hard requirement for being able to do headless rendering - this set of requirements got us working on 18.04 in the past:

apt-get install mesa-utils-extra libglvnd-dev libnvidia-gl-460 --no-install-recommends

I use docker to train ppo on A100, but don't work.

Great - this is actually a bug we've seen before, but were unable to reproduce - can you send us some details about your host system and the Nvidia driver version? (Basically, Godot is crashing when trying to exit due to trying to free some memory twice.)

Hi, I successfully use docker to train PPO with RLlib. As written in the notes, RLLib is slower to train. When I use Avalon RL library to train PPO, there is a bug as shown above. My host system is

$ cat /proc/version
Linux version 3.10.0-957.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC) ) #1 SMP Thu Nov 8 23:39:32 UTC 2018

, and I build the docker image by Dockerfile which your provide. The Nvidia driver version is 460.32.03.

mx781 commented 1 year ago

As @bawr mentioned above, installing the GL packages is required for Avalon to run. As for the the error in the screenshot - please open another issue with more details about your host system and the Nvidia driver version. Thanks!