Azure / az-hop

The Azure HPC On-Demand Platform provides an HPC Cluster Ready solution
https://azure.github.io/az-hop/
MIT License
62 stars 52 forks source link

SLURM : vglrun create a segmentation fault when running glxspheres64 #1232

Open xpillons opened 1 year ago

xpillons commented 1 year ago

when running glxspheres64 on a shared node with vglrun a seg fault is generated, while running without works.

[adminuser@largeviz3d-1 ~]$ vglrun /opt/VirtualGL/bin/glxspheres64 
Polygons in scene: 62464 (61 spheres * 1024 polys/spheres)
GLX FB config ID of window: 0x396 (8/8/8/0)
Visual ID of window: 0x21
Segmentation fault (core dumped)
[adminuser@largeviz3d-1 ~]$ /usr/bin/vglrun /opt/VirtualGL/bin/glxspheres64 
Polygons in scene: 62464 (61 spheres * 1024 polys/spheres)
GLX FB config ID of window: 0x7d (8/8/8/0)
Visual ID of window: 0x21
Segmentation fault (core dumped)
[adminuser@largeviz3d-1 ~]$  /opt/VirtualGL/bin/glxspheres64 
Polygons in scene: 62464 (61 spheres * 1024 polys/spheres)
GLX FB config ID of window: 0x119 (8/8/8/0)
Visual ID of window: 0x2da
Context is Direct
OpenGL Renderer: llvmpipe (LLVM 7.0, 256 bits)
36.331608 frames/sec - 40.546075 Mpixels/sec
35.739210 frames/sec - 39.884958 Mpixels/sec
35.694546 frames/sec - 39.835113 Mpixels/sec
36.036118 frames/sec - 40.216308 Mpixels/sec
xpillons commented 6 months ago

Reason is because Slurm restrict the number of GPUs based on the --gpus job option, meaning that the number of GPU visible when running in a job context can be smaller than the number of GPU devices. The vglrun alias command line is using the number of NVIDIA devices to set the number of GPU, and this is wrong for Slurm