Closed traversaro closed 10 months ago
@isuruf Should we add a separate rocm built and have the default version build without rocm?
A side effect of this is that in some cases downstream projects are linking rocm_smi64
library, as the .pc file for nocuda builds is:
prefix=/home/traversaro/mambaforge/envs/libhwloc
exec_prefix=${prefix}
libdir=${exec_prefix}/lib
includedir=${prefix}/include
Name: hwloc
Description: Hardware locality detection and management library
Version: 2.9.2
Requires.private: libxml-2.0
Cflags: -I${includedir}
Libs: -L${libdir} -lhwloc
Libs.private: -lm -lrocm_smi64 -L/home/traversaro/mambaforge/envs/libhwloc/lib -lxml2 -lpthread
Anyhow, this .pc file is actually correct for rocm-enabled builds, the actual problem is https://github.com/conda-forge/conda-forge.github.io/issues/1880, and the actual solution is to start using pkgconf in place of pkg-config.
Yeah, a separate build with rocm makes sense
fyi @fl-ferr
In some internal workflows with @fl-ferr started seeing errors like:
root@0b32781a2be8:/# python -m rl_zoo3.train --algo sac --env Pendulum-v1 --track
2023-08-29 13:47:39.614246: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-08-29 13:47:39.649210: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-08-29 13:47:39.649565: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-29 13:47:40.329235: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
========== Pendulum-v1 ==========
Seed: 2454984228
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
cat: /sys/module/amdgpu/initstate: No such file or directory
ERROR:root:Driver not initialized (amdgpu not found in modules)
That went away as soon as rocm-smi
was uninstalled. I am not sure what actually triggered this error, but as anyhow the consensus was that a separate variant for rocm made sense, I implemented it in https://github.com/conda-forge/libhwloc-feedstock/pull/66 .
Solution to issue cannot be found in the documentation.
Issue
Since https://github.com/conda-forge/libhwloc-feedstock/pull/62 have been merged, all programs that use hwloc if installed and run on a machine without any cuda or rocm graphic card, print an error message related to failure in rocm initialization.
See for example
hwloc-ls
:or
hwloc-info
:The return code of the program is still
0
(i.e. success), but anyhow I was wondering if this was an intended behaviour, as it may be confusing for users.Installed packages
Environment info