apptainer / apptainer

Apptainer: Application containers for Linux
https://apptainer.org
Other
945 stars 118 forks source link

`--rocm` includes unnecessary libraries #1621

Open upsj opened 9 months ago

upsj commented 9 months ago

Version of Apptainer

What version of Apptainer (or Singularity) are you using? main branch

Expected behavior

Containers should provide libraries like rocBLAS, rocFFT etc, the host libraries should not be forwarded for this case.

Actual behavior

The host libraries get loaded into .singularity.d/libs

When running a binary that tries to use rocBLAS, this leads to the following error:

rocBLAS error: Cannot read /.singularity.d/libs/rocblas/library/TensileLibrary.dat: Illegal seek

To my knowledge, both rocBLAS and rocFFT rely on additional files, probably for JIT compilation. It should be safest to let the container provide these files to avoid incompatibilities.

Steps to reproduce this behavior

Run apptainer shell --rocm on a container containing and using librocblas.so. The host-provided library will be used instead. Trying to run a binary that uses rocBLAS will fail with the above error

What OS/distro are you running

Ubuntu 22.04.2 LTS

How did you install Apptainer

from source

GodloveD commented 4 months ago

We discussed this issue and the associated PR (#1622) in the community meeting today. @DrDaveD asked me to create a new issue, but I thought it better to continue on this existing issue instead.

As of now, we believe that the original list of libraries in rocmliblist.conf is flawed. But we are still trying to arrive at a more comprehensive list that covers the majority of use cases. @upsj has determined a different list of libs that works for their use case and has proposed #1622 to update. I think this is an improvement over the current state, but I'm concerned it is still not a comprehensive and general list that will work across most use cases. I've proposed the following list instead based on version numbers in the library names and their correspondence with the compiled kernel module.

libamd_comgr.so
libamdhip64.so
libhiprtc-builtins.so
libhiprtc.so
libhsa-runtime64.so
librocm-core.so

We are hoping that someone with an AMD GPU can test this list with some workloads. Failing that, I will propose a new PR that simply comments out the existing list and adds this one in its place. Hopefully, this will not be too disruptive since users can comment/uncomment libraries if they find that their GPU-enabled workflows are failing.

DrDaveD commented 4 months ago

I didn't remember there was an existing issue because it wasn't marked with the 1.3.0 milestone. That's fixed now.