conda-forge / libhwloc-feedstock

A conda-smithy repository for libhwloc.
BSD 3-Clause "New" or "Revised" License
0 stars 9 forks source link

rocm support enabled in nocuda builds and installed in non-AMD machines? #64

Closed traversaro closed 10 months ago

traversaro commented 1 year ago

Solution to issue cannot be found in the documentation.

Issue

Since https://github.com/conda-forge/libhwloc-feedstock/pull/62 have been merged, all programs that use hwloc if installed and run on a machine without any cuda or rocm graphic card, print an error message related to failure in rocm initialization.

See for example hwloc-ls :

(libhwloc) traversaro@IITICUBLAP257:~/mambaforge/envs/libhwloc/include$ hwloc-ls
Exception caught: rsmi_init.
hwloc/rsmi: Failed to initialize with rsmi_init(): RSMI_STATUS_INIT_ERROR: An error occurred during initialization, during monitor discovery or when when initializing internal data structures
Machine (15GB total)
  Package L#0
    NUMANode L#0 (P#0 15GB)
    L3 L#0 (12MB)
      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#1)
      L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#2)
        PU L#3 (P#3)
      L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#4)
        PU L#5 (P#5)
      L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#6)
        PU L#7 (P#7)
      L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#8)
        PU L#9 (P#9)
      L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#10)
        PU L#11 (P#11)
  HostBridge
    PCI 022f:00:00.0 (3D)
  HostBridge
    PCI 1fcc:00:00.0 (3D)
  HostBridge
    PCI 2564:00:00.0 (SCSI)
  HostBridge
    PCI 50c2:00:00.0 (SCSI)
  HostBridge
    PCI 968a:00:00.0 (SCSI)
  HostBridge
    PCI e0bd:00:00.0 (SCSI)
  Block(Disk) "sdb"
  Block(Disk) "sdc"
  Block(Disk) "sda"
  Net "eth0"

or hwloc-info :

 (libhwloc) traversaro@IITICUBLAP257:~$ hwloc-info
Exception caught: rsmi_init.
depth 0:           1 Machine (type #0)
 depth 1:          1 Package (type #1)
  depth 2:         1 L3Cache (type #6)
   depth 3:        6 L2Cache (type #5)
    depth 4:       6 L1dCache (type #4)
     depth 5:      6 L1iCache (type #9)
      depth 6:     6 Core (type #2)
       depth 7:    12 PU (type #3)
Special depth -3:  1 NUMANode (type #13)
Special depth -4:  6 Bridge (type #14)
Special depth -5:  6 PCIDev (type #15)
Special depth -6:  4 OSDev (type #16)

The return code of the program is still 0 (i.e. success), but anyhow I was wondering if this was an intended behaviour, as it may be confusing for users.

Installed packages

(libhwloc) traversaro@IITICUBLAP257:~/mambaforge/envs/libhwloc/include$ conda list
# packages in environment at /home/traversaro/mambaforge/envs/libhwloc:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
icu                       72.1                 hcb278e6_0    conda-forge
libgcc-ng                 13.1.0               he5830b7_0    conda-forge
libgomp                   13.1.0               he5830b7_0    conda-forge
libhwloc                  2.9.2           nocuda_h7313eea_1008    conda-forge
libiconv                  1.17                 h166bdaf_0    conda-forge
libstdcxx-ng              13.1.0               hfd8a6a1_0    conda-forge
libxml2                   2.11.5               h0d562d8_0    conda-forge
libzlib                   1.2.13               hd590300_5    conda-forge
rocm-smi                  5.6.0                h59595ed_1    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge

Environment info

(libhwloc) traversaro@IITICUBLAP257:~/mambaforge/envs/libhwloc/include$ conda info

     active environment : libhwloc
    active env location : /home/traversaro/mambaforge/envs/libhwloc
            shell level : 1
       user config file : /home/traversaro/.condarc
 populated config files : /home/traversaro/mambaforge/.condarc
                          /home/traversaro/.condarc
          conda version : 23.3.1
    conda-build version : 3.25.0
         python version : 3.10.10.final.0
       virtual packages : __archspec=1=x86_64
                          __cuda=12.2=0
                          __glibc=2.35=0
                          __linux=5.15.90.2=0
                          __unix=0=0
       base environment : /home/traversaro/mambaforge  (writable)
      conda av data dir : /home/traversaro/mambaforge/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : /home/traversaro/mambaforge/pkgs
                          /home/traversaro/.conda/pkgs
       envs directories : /home/traversaro/mambaforge/envs
                          /home/traversaro/.conda/envs
               platform : linux-64
             user-agent : conda/23.3.1 requests/2.31.0 CPython/3.10.10 Linux/5.15.90.2-microsoft-standard-WSL2 ubuntu/22.04.2 glibc/2.35
                UID:GID : 1000:1000
             netrc file : None
           offline mode : False
jan-janssen commented 1 year ago

@isuruf Should we add a separate rocm built and have the default version build without rocm?

traversaro commented 1 year ago

A side effect of this is that in some cases downstream projects are linking rocm_smi64 library, as the .pc file for nocuda builds is:

prefix=/home/traversaro/mambaforge/envs/libhwloc
exec_prefix=${prefix}
libdir=${exec_prefix}/lib
includedir=${prefix}/include

Name: hwloc
Description: Hardware locality detection and management library
Version: 2.9.2
Requires.private: libxml-2.0
Cflags: -I${includedir}
Libs: -L${libdir} -lhwloc
Libs.private: -lm  -lrocm_smi64 -L/home/traversaro/mambaforge/envs/libhwloc/lib -lxml2    -lpthread

Anyhow, this .pc file is actually correct for rocm-enabled builds, the actual problem is https://github.com/conda-forge/conda-forge.github.io/issues/1880, and the actual solution is to start using pkgconf in place of pkg-config.

isuruf commented 1 year ago

Yeah, a separate build with rocm makes sense

traversaro commented 1 year ago

fyi @fl-ferr

traversaro commented 1 year ago

In some internal workflows with @fl-ferr started seeing errors like:

root@0b32781a2be8:/# python -m rl_zoo3.train --algo sac --env Pendulum-v1 --track
2023-08-29 13:47:39.614246: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-08-29 13:47:39.649210: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-08-29 13:47:39.649565: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-29 13:47:40.329235: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
========== Pendulum-v1 ==========
Seed: 2454984228
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
cat: /sys/module/amdgpu/initstate: No such file or directory
ERROR:root:Driver not initialized (amdgpu not found in modules)

That went away as soon as rocm-smi was uninstalled. I am not sure what actually triggered this error, but as anyhow the consensus was that a separate variant for rocm made sense, I implemented it in https://github.com/conda-forge/libhwloc-feedstock/pull/66 .

traversaro commented 10 months ago

Fixed by https://github.com/conda-forge/libhwloc-feedstock/pull/66 .