FLAMEGPU / FLAMEGPU2

FLAME GPU 2 is a GPU accelerated agent based modelling framework for CUDA C++ and Python
https://flamegpu.com
MIT License
99 stars 19 forks source link

Python wheel `libnvrtc-builtins.so` #1193

Closed ptheywood closed 3 weeks ago

ptheywood commented 3 months ago

Originally identified as part of #1191, although nvrtc sonames are now major version only (since 11.3, using .so.11.2 or .so.12.0), libnvrtc-builtins.so is still explicitly versioned, and is dependended upon by the generic libnvrtc.so which we are linking against.

In practice, this means that currently you must have the exact version of the CTK installed and available at runtime as was used to build the python wheel, rather than a "compatible" version.

This was not noticed when testing locally, as CMake causes a RPATH/RUNPATH to be set pointing at the exact location of the shared object, so if installed in the same location it is found even if LD_LIBRARY_PATH does not point to it.


A workaround has been added to the google-colab notebooke in https://github.com/FLAMEGPU/FLAMEGPU2-tutorial-python/commit/47cc1c119654c79509934164228f52b4bf59713b, which installs the nvidia-cuda-nvrtc-cu12==12.0.140 python package to bring in the matching version of libnvrtc-builtins.so, and then explicitly imports it via ctypes.CDLL to ensure it is loaded.

A robust verison of this fix, would be to add the exact version of nvidia-cuda-nvrtc-cu12 (or nvidia-cuda-nvrtc-cu11) required to our python packages' extra_install_requires in setup.py, when the packages are intended to be distributed (i.e. only in our CI, not in all local builds to avoid binary bloat).

In __init__.py, we would then need to ensure that the appropraite .so is loaded at runtime if it is not found implicilty, i.e. try and import it system wide, catch errors and if the error is appropraite explicitly load the exact libnvrtc.so provided by the python package.

Unfortunately, nvidia-cuda-nvrtc-cu11 is only provided on pypi for 11.7.99 and 11.8.89, so this does not work for our 11.2 wheels...

Alternative options include:

More detailed investigation notes can be found in #1191

ptheywood commented 3 months ago

Running readelf -d /path/to/_pyflamegpu.so for local builds on ubuntu do not appear to depend on versioned libnvrtc-builtins.so explcitly, just libnvrtc.so.12.

The version installed from the distributed wheel however depends on libnvrtc-builtins.so.12.0 explicitly.

Cmake 3.28.3 was used locally, on ubuntu 22.04, while the manylinux2014 container from January 2024 was CMake 3.28.1 and is Cent7 based.

During pyflamepgu linking libnvrtc-builtins is present in the command for both cases, but it is unversioned in both, so the source of the difference is still unclear.

ptheywood commented 3 months ago

adding the following to CI to readelf in ubuntu and manylinux builds on CI in the readelf-test branch confirms that ubuntu ci builds do not depend on libnvrtc-builtins.so, but centos builds (and alma 8 builds) do.

find . -name "_pyflamegpu.so" | head -n 1 | xargs readelf -d

This could be a difference in gcc, binutils, some environment vairiable, platform specific cmake behaviour, platform specific cuda packages or something else entirely, and I am unsure where to start looking to actually pin this down.

If we can encourage cent/alma to not add the explicit link, then we wouldn't need any other workarounds (although adding the dependeing on libnvrtc.so via python package might still be a good idea, but without also dlopening it it won't help with manylinux compliance).

ptheywood commented 3 months ago

Checking linker defaults via gcc -Q -v

local Ubuntu

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/11/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 11.4.0-1ubuntu1~22.04' --with-bugurl=file:///usr/share/doc/gcc-11/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-11 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-11-XeT9lY/gcc-11-11.4.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-11-XeT9lY/gcc-11-11.4.0/debian/tmp-gcn/usr --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --with-build-config=bootstrap-lto-lean --enable-link-serialization=2
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04) 

alma 8

gcc -Q -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/opt/rh/gcc-toolset-12/root/usr/libexec/gcc/x86_64-redhat-linux/12/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/opt/rh/gcc-toolset-12/root/usr --mandir=/opt/rh/gcc-toolset-12/root/usr/share/man --infodir=/opt/rh/gcc-toolset-12/root/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --enable-libstdcxx-backtrace --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl=/builddir/build/BUILD/gcc-12.2.1-20221121/obj-x86_64-redhat-linux/isl-install --enable-offload-targets=nvptx-none --without-cuda-driver --enable-offload-defaulted --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 12.2.1 20221121 (Red Hat 12.2.1-7) (GCC) 
ptheywood commented 3 months ago

Linker command for _pyflamegpu.so from CI

Ubuntu 22.04 CI

2024-04-02T15:44:21.7533132Z /usr/bin/g++-12 -fPIC -O3 -DNDEBUG -shared -Wl,-soname,_pyflamegpu.so -o _pyflamegpu.so CMakeFiles/pyflamegpu_swig.dir/pyflamegpu/flamegpuPYTHON_wrap.cxx.o CMakeFiles/pyflamegpu_swig.dir/cmake_device_link.o   -L/usr/local/cuda-12.3/targets/x86_64-linux/lib/stubs  -L/usr/local/cuda-12.3/targets/x86_64-linux/lib  -Wl,-rpath,/usr/local/cuda-12.3/targets/x86_64-linux/lib ../../lib/Release/libflamegpu.a ../../lib/Release/libtinyxml2.a /usr/local/cuda-12.3/targets/x86_64-linux/lib/libnvrtc.so /usr/local/cuda-12.3/targets/x86_64-linux/lib/libnvrtc-builtins.so /usr/local/cuda-12.3/targets/x86_64-linux/lib/stubs/libcuda.so /usr/lib/x86_64-linux-gnu/librt.a -ldl -lcudadevrt -lcudart_static -lrt -lpthread -ldl

Centos 7 (manylinux2014 ci)

2024-04-02T15:56:39.2275049Z /opt/rh/devtoolset-10/root/usr/bin/g++ -fPIC -O3 -DNDEBUG -shared -Wl,-soname,_pyflamegpu.so -o _pyflamegpu.so CMakeFiles/pyflamegpu_swig.dir/pyflamegpu/flamegpuPYTHON_wrap.cxx.o CMakeFiles/pyflamegpu_swig.dir/cmake_device_link.o   -L/usr/local/cuda-12.0/targets/x86_64-linux/lib/stubs  -L/usr/local/cuda-12.0/targets/x86_64-linux/lib  -Wl,-rpath,/usr/local/cuda-12.0/targets/x86_64-linux/lib ../../lib/Release/libflamegpu.a ../../lib/Release/libtinyxml2.a /usr/local/cuda-12.0/targets/x86_64-linux/lib/libnvrtc.so /usr/local/cuda-12.0/targets/x86_64-linux/lib/libnvrtc-builtins.so /usr/local/cuda-12.0/targets/x86_64-linux/lib/stubs/libcuda.so /usr/lib64/librt.so -ldl -lpthread -lcudadevrt -lcudart_static -lrt -lpthread -ldl
ptheywood commented 3 months ago

Readelf output from CI:

notably centos includes libnvrtc-builtins.so.12.0 which is the problem.

Ubuntu

Dynamic section at offset 0x443a580 contains 32 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libnvrtc.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libcuda.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [ld-linux-x86-64.so.2]
 0x000000000000000e (SONAME)             Library soname: [_pyflamegpu.so]
 0x000000000000001d (RUNPATH)            Library runpath: [/usr/local/cuda-12.3/targets/x86_64-linux/lib]
 0x000000000000000c (INIT)               0x152000
 0x000000000000000d (FINI)               0x9b9cd4
 0x0000000000000019 (INIT_ARRAY)         0x4430000
 0x000000000000001b (INIT_ARRAYSZ)       856 (bytes)
 0x000000000000001a (FINI_ARRAY)         0x4430358
 0x000000000000001c (FINI_ARRAYSZ)       16 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x[29](https://github.com/FLAMEGPU/FLAMEGPU2/actions/runs/8525497640/job/23352575379#step:15:30)8
 0x0000000000000005 (STRTAB)             0x36b88
 0x0000000000000006 (SYMTAB)             0xca38
 0x000000000000000a (STRSZ)              678289 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000003 (PLTGOT)             0x443f000
 0x0000000000000002 (PLTRELSZ)           78816 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0x13e078
 0x0000000000000007 (RELA)               0xdffc8
 0x0000000000000008 (RELASZ)             385200 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffffe (VERNEED)            0xdfd[38](https://github.com/FLAMEGPU/FLAMEGPU2/actions/runs/8525497640/job/23352575379#step:15:39)
 0x000000006fffffff (VERNEEDNUM)         6
 0x000000006ffffff0 (VERSYM)             0xdc51a
 0x000000006ffffff9 (RELACOUNT)          11609
 0x0000000000000000 (NULL)               0x0

Centos

Dynamic section at offset 0x42ac8a0 contains 36 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libnvrtc.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libnvrtc-builtins.so.12.0]
 0x0000000000000001 (NEEDED)             Shared library: [libcuda.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [librt.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [ld-linux-x86-64.so.2]
 0x000000000000000e (SONAME)             Library soname: [_pyflamegpu.so]
 0x000000000000000f (RPATH)              Library rpath: [/usr/local/cuda-12.0/targets/x86_64-linux/lib]
 0x000000000000000c (INIT)               0x150000
 0x000000000000000d (FINI)               0x9a58ac
 0x0000000000000019 (INIT_ARRAY)         0x42a2000
 0x000000000000001b (INIT_ARRAYSZ)       864 (bytes)
 0x000000000000001a (FINI_ARRAY)         0x42a2360
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x[29](https://github.com/FLAMEGPU/FLAMEGPU2/actions/runs/8525497643/job/23352575103#step:12:30)8
 0x0000000000000005 (STRTAB)             0x3dbe0
 0x0000000000000006 (SYMTAB)             0xd940
 0x000000000000000a (STRSZ)              636786 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000003 (PLTGOT)             0x42b1000
 0x0000000000000002 (PLTRELSZ)           88344 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0x13a100
 0x0000000000000007 (RELA)               0xdd5b0
 0x0000000000000008 (RELASZ)             379728 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffffe (VERNEED)            0xdd390
 0x000000006fffffff (VERNEEDNUM)         9
 0x000000006ffffff0 (VERSYM)             0xd9352
 0x000000006ffffff9 (RELACOUNT)          11[32](https://github.com/FLAMEGPU/FLAMEGPU2/actions/runs/8525497643/job/23352575103#step:12:33)9
 0x0000000000000000 (NULL)               0x0
mondus commented 2 months ago

Suggestion. Try a build of a simple example on Sheffield HPC system (Cent 7) to see if we can replicate this.

ptheywood commented 2 months ago

A possible fix is to use patchelf to remove the libnvrtc-builtins.so.MM.mm dependency at pyflamegpu build time.

Patchelf is availble in manylinux, so doable, but getting the command in the right place of our cmake during building _pyflamegpu.so building is non trivial.

ptheywood commented 2 months ago

A build on Stanage of current master (b5173e78765d03be8d1b393f8a037499b064b676) using:

module load CUDA/12.0.0 GCC/12.3.0 Python/3.11.3-GCCcore-12.3.0 CMake/3.26.3-GCCcore-12.3.0
cmake .. -DCMAKE_CUDA_ARCHITECTURES="80;90" -DFLAMEGPU_BUILD_PYTHON=ON
cmake --build . --target pyflamegpu -j `nproc`

Did not result in libnvrtc-builtins.so.12 being a dependency.

$ readelf -d ./lib/Release/python/src/pyflamegpu/_pyflamegpu.so 

Dynamic section at offset 0x79be2e0 contains 36 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libnvrtc.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libnvJitLink.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libcuda.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [librt.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [ld-linux-x86-64.so.2]
 0x000000000000000e (SONAME)             Library soname: [_pyflamegpu.so]
 0x000000000000000c (INIT)               0x14c000
...

So this is not universal on centos 7, however the version of gcc/ld on Stanage is provided by easybuild (i.e. /opt/apps/testapps/el7-znver3/software/staging/binutils/2.40-GCCcore-12.3.0/bin/ld not the centos devtoolset packages.

ptheywood commented 2 months ago

Confirmed via a CI run that manylinux_2_28 also results in libnvrtc-builtins.so.12.0 being linked, so its not centos7 specific, but either manylinux specific or devtoolset specific?

https://github.com/FLAMEGPU/FLAMEGPU2/actions/runs/8850510611/job/24304952167

Dynamic section at offset 0x42745a8 contains 36 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libnvrtc.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libnvrtc-builtins.so.12.0]
ptheywood commented 1 month ago

Could possibly look into other libraries distributing binaries that are linked against NVRTC that are built on centos using devtoolset compilers, to see if they have similar issues with builtins (i.e. pytorch, though they might handle it differently entirely for manylinux complaince. This could even just be downloading a wheel and checking readelf output).

Alternativley trying a build on a EL derived system, with a devtoolset provided host compiler could be worth trying, to narrow down if it's EL or if it's a manylinux container difference to narrow the cause down. This could be done in a docker container, or a HPC system with EL provided compilers (Bede with the native host compiler perhaps, though platform specific differences might also be a factor then).