BNLNPPS / esi-shell

Apache License 2.0
0 stars 0 forks source link

Problem with portability of esi-shell images build on dedicated GPU #87

Closed plexoos closed 3 months ago

plexoos commented 3 months ago

Currently, the official images are build on a system with NVIDIA GPUs:

npps0

npps0$ nvidia-smi   
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:16:00.0 Off |                  Off |
|  0%   38C    P8             18W /  450W |     132MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off |   00000000:34:00.0 Off |                  Off |
|  0%   30C    P8             10W /  450W |       2MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

$ docker --version
Docker version 26.1.3, build b72abbb

However, running the images fails on the following test systems:

lambda

$ nvidia-smi  
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro GV100        On   | 00000000:01:00.0 Off |                  Off |
| 30%   42C    P2    25W / 250W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

$ docker --version
Docker version 26.1.3, build b72abbb

onyx

$ nvidia-smi    
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro M4000                   Off | 00000000:03:00.0 Off |                  N/A |
| 46%   36C    P8              11W / 120W |    639MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

$ docker --version
Docker version 24.0.5, build 24.0.5-0ubuntu1~20.04.1
plexoos commented 3 months ago

For CUDA and NVIDIA driver compatibility see https://docs.nvidia.com/deploy/cuda-compatibility/

plexoos commented 3 months ago

Downgraded NVIDIA driver on npps0

$ nvidia-smi
Tue Jun 18 12:10:48 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090         Off| 00000000:16:00.0 Off |                  Off |
|  0%   35C    P8               12W / 450W|    108MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090         Off| 00000000:34:00.0 Off |                  Off |
|  0%   27C    P8               10W / 450W|      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
plexoos commented 3 months ago

How to reproduce:

./esi-shell -v 1.0.0-beta.13 "opticks-full-prepare && opticks-t"

...
=== opticks-setup-geant4- : sourcing /opt/spack/opt/spack/linux-ubuntu22.04-sapphirerapids/gcc-11.4.0/geant4-11.1.2-z4zrvgbct5pf2uhyrxf7xlo5mjalfiwf/./bin/geant4.sh
=== om-test-one : okconf          /esi/opticks/okconf                                          /usr/local/opticks/build/okconf                              
Wed Jun 19 00:33:44 UTC 2024
ctest --interactive-debug-mode 0 --output-on-failure
Wed Jun 19 00:33:44 UTC 2024
=== om-test-one : sysrap          /esi/opticks/sysrap                                          /usr/local/opticks/build/sysrap                              
Wed Jun 19 00:33:44 UTC 2024
ctest --interactive-debug-mode 0 --output-on-failure
Wed Jun 19 00:33:44 UTC 2024
=== om-test-one : ana             /esi/opticks/ana                                             /usr/local/opticks/build/ana                                 
Wed Jun 19 00:33:44 UTC 2024
ctest --interactive-debug-mode 0 --output-on-failure
Wed Jun 19 00:33:44 UTC 2024
=== om-test-one : analytic        /esi/opticks/analytic                                        /usr/local/opticks/build/analytic                            
Wed Jun 19 00:33:45 UTC 2024
ctest --interactive-debug-mode 0 --output-on-failure
Wed Jun 19 00:33:45 UTC 2024
=== om-test-one : bin             /esi/opticks/bin                                             /usr/local/opticks/build/bin                                 
Wed Jun 19 00:33:45 UTC 2024
ctest --interactive-debug-mode 0 --output-on-failure
Wed Jun 19 00:33:45 UTC 2024
=== om-test-one : CSG             /esi/opticks/CSG                                             /usr/local/opticks/build/CSG                                 
Wed Jun 19 00:33:45 UTC 2024
ctest --interactive-debug-mode 0 --output-on-failure
Wed Jun 19 00:33:45 UTC 2024
=== om-test-one : qudarap         /esi/opticks/qudarap                                         /usr/local/opticks/build/qudarap                             
Wed Jun 19 00:33:45 UTC 2024
ctest --interactive-debug-mode 0 --output-on-failure
Wed Jun 19 00:33:45 UTC 2024
=== om-test-one : gdxml           /esi/opticks/gdxml                                           /usr/local/opticks/build/gdxml                               
Wed Jun 19 00:33:45 UTC 2024
ctest --interactive-debug-mode 0 --output-on-failure
Wed Jun 19 00:33:45 UTC 2024
=== om-test-one : u4              /esi/opticks/u4                                              /usr/local/opticks/build/u4                                  
Wed Jun 19 00:33:45 UTC 2024
ctest --interactive-debug-mode 0 --output-on-failure
Wed Jun 19 00:33:45 UTC 2024
=== om-test-one : CSGOptiX        /esi/opticks/CSGOptiX                                        /usr/local/opticks/build/CSGOptiX                            
Wed Jun 19 00:33:45 UTC 2024
ctest --interactive-debug-mode 0 --output-on-failure
Wed Jun 19 00:33:45 UTC 2024
=== om-test-one : g4cx            /esi/opticks/g4cx                                            /usr/local/opticks/build/g4cx                                
Wed Jun 19 00:33:45 UTC 2024
ctest --interactive-debug-mode 0 --output-on-failure
Wed Jun 19 00:33:45 UTC 2024
...

The tests appear to be skipped when running on lambda or onyx

plexoos commented 3 months ago

Another more targeted test with just the cmake command:

dsmirnov@lambda1:~/test$ ./esi-shell -v 1.0.0-beta.13 "cmake --help"
==> Using esi-shell image: ghcr.io/bnlnpps/esi-shell:1.0.0-beta.13

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

dsmirnov@lambda1:~/test$ 

No output. And interactively we get:

dsmirnov@lambda1:~/test$ ./esi-shell -v 1.0.0-beta.13
==> Using esi-shell image: ghcr.io/bnlnpps/esi-shell:1.0.0-beta.13

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

root@e1c3e924fafd:~# cmake
Illegal instruction (core dumped)
plexoos commented 3 months ago

Need to see the effect of setting Spack target to generic microarchitectures. See https://spack.readthedocs.io/en/latest/build_settings.html

plexoos commented 3 months ago

Fixed by #90

plexoos commented 3 months ago

Our test nodes belong to Maxwell (sm_52), Volta (sm_70), and Ada (sm_89) generations