alpaka-group / alpaka

Abstraction Library for Parallel Kernel Acceleration :llama:
https://alpaka.readthedocs.io
Mozilla Public License 2.0
358 stars 74 forks source link

A binary linked against cudart/curand exits immediately shuts down on windows, if no cuda is available on the system #2341

Open ichinii opened 3 months ago

ichinii commented 3 months ago

I am not sure if this is a real issue. Feel free to close it, if it's not of interest.

We had a nasty situation where in our github CI (windows-latest), the binary immediately shut down with exit code 1, before reaching the main function. Reason for that is, that the binary links against cudart/curand because alpaka does interface those libraries in cmake. Using the flag -Dalpaka_DISABLE_VENDOR_RNG=ON fixed the issue.

This problem does not occur when using github CI (ubuntu-latest).

You might ask, why one would one compile with cuda, but not use it. Reason is, that we want to ship a release with the ability to run code on the GPU, without forcing on the user the necessity of having a nvidia video card.

The 'issue' with this is, that this is so nasty

fwyzard commented 3 months ago

Hi @ichinii, is the binary supposed to use the CUDA backend, or not?

If yes, then linking dynamically with libcudart.so or statically with libcudart.a is necessary.

If you want to build a binary that uses CUDA only if it's available, I think you need to split the functionality in separate shared libraries and load them at runtime, allowing for it to fail if CUDA is not available.

This is not specific to Alpaka - though I'm not sure if the Alpaka CMake rules help or hinder in this respect.

psychocoderHPC commented 3 months ago

As @fwyzard said to ship a binary that can be executed even if you do not have all CUDA libs available you must provide a coda path for CUDA in a shared library. You dynamicly load this library at runtime. In case there is no CUDA the loading will fail and you can handle this at runtime and switch to a CPU pass.

Take care if you use shared library under windows and you have a singelton in your application the singelton instance in the shared library is not equal to the instance in the main binary. Thats a strange behaviour under windows. I know only ugly workarounds for this problem.

ichinii commented 3 months ago

Thanks for your advice. At least on linux i am pretty convinced, that a cuda compiled application does not load cuda shared libraries right away, but loads them on demand.

ldd try_cuda
    linux-vdso.so.1 (0x00007acdadc59000)
    libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007acdad800000)
    libm.so.6 => /usr/lib/libm.so.6 (0x00007acdada8c000)
    libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007acdad7d3000)
    libc.so.6 => /usr/lib/libc.so.6 (0x00007acdad5e7000)
    /lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007acdadc5b000)

I am hoping that this is the same for windows.

fwyzard commented 3 months ago

How did you build try_cuda ?

By default nvcc will link statically libcudart_static.a, and in turn that one uses dlopen() to load libcuda.so.

ichinii commented 3 months ago

Sorry, long post. Jump to the end if you just want to wrap up on the original concern

How did you build try_cuda ?

// main.cu
int main([[maybe_unused]] int argc, [[maybe_unused]] char* argv[]) {
return 0;
}
nvcc main.cu -o try_cuda

ldd try_cuda
linux-vdso.so.1 (0x000077d78206e000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x000077d781c00000)
libm.so.6 => /usr/lib/libm.so.6 (0x000077d781ea2000)
libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x000077d781bd3000)
libc.so.6 => /usr/lib/libc.so.6 (0x000077d7819e7000)
/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x000077d782070000)

> By default `nvcc` will link statically `libcudart_static.a`, and in turn that one uses `dlopen()` to load `libcuda.so`.

I think you are right, that per default, cuda is linked statically. `nm` shows cuda symbols
```bash
nm try_cuda | grep cuda | head -n 5
9:0000000000057ad0 T cudaArrayGetInfo
10:0000000000057f00 T cudaArrayGetMemoryRequirements
11:0000000000057d10 T cudaArrayGetPlane
12:00000000000582e0 T cudaArrayGetSparseProperties
13:000000000004d3a0 T cudaChooseDevice

and libcuda then gets loaded at runtime. Though my guess was right, that it only gets loaded as soon as cuda code is invoked. If the application does not make use of cuda code, then libcuda wont be loaded. I tested this by printing loaded shared libraries at runtime using rtld:

// audit.cpp
#include <iostream>
#include <link.h>

// ignore this function
unsigned int la_version(unsigned int version) {
  if (version == 0) { return version; }
  return LAV_CURRENT;
}

char *la_objsearch(const char *name, uintptr_t *cookie, unsigned int flag) {
    // print dynamically loaded libs
    std::cout << name << std::endl;
    return const_cast<char*>(name);
}

int main([[maybe_unused]] int argc, [[maybe_unused]] char* argv[]) {

    return 0;
}
g++ -fPIC -shared -O3 -g -o auditlib.so audit.cpp
//main.cu
#include <iostream>

int main([[maybe_unused]] int argc, [[maybe_unused]] char* argv[]) {
    std::cout << "main()" << std::endl;

    // invoke cuda function conditionally
    if (1 < argc) {
        int* a;
        cudaMalloc(&a, 4);
    }

    return 0;
}
nvcc src/main.cu -o try_cuda
# here we dont invode cuda functions
LD_AUDIT=auditlib.so ./try_cuda
libstdc++.so.6
/usr/lib/libstdc++.so.6
libm.so.6
/usr/lib/libm.so.6
libgcc_s.so.1
/usr/lib/libgcc_s.so.1
libc.so.6
/usr/lib/libc.so.6
main()

as seen, no libcuda is loaded

# this time we invoke cuda functions
LD_AUDIT=auditlib.so ./try_cuda alpaka is awesome
libstdc++.so.6
/usr/lib/libstdc++.so.6
libm.so.6
/usr/lib/libm.so.6
libgcc_s.so.1
/usr/lib/libgcc_s.so.1
libc.so.6
/usr/lib/libc.so.6
main()
libcuda.so.1
/usr/lib/libcuda.so.1
libdl.so.2
/usr/lib/libdl.so.2
libpthread.so.0
/usr/lib/libpthread.so.0
librt.so.1
/usr/lib/librt.so.1
libcrypto.so.3
/usr/lib/libcrypto.so.3

as soon as we use cudaMalloc, libcuda will be loaded \ \ I think we went a little off the road here. Let me get back to my original concern: Alpaka adds cuda as shared libraries as a default behavior. When the application is executed, it is required, that cuda is installed on the system. Imo this is restrictive default behavior and difficult to dig. We want to enable the alpaka cuda backend, but allow people to run the openmp backend instead, if they do not have an nvidia graphics card. I think this is a common use case. As long as alpaka_DISABLE_VENDOR_RNG is not turned off, this is not possible. Imo alpaka_DISABLE_VENDOR_RNG should be turned on by default, to protect other developers from digging all of this^^

Guess in the end its a design decision, of wether alpaka_DISABLE_VENDOR_RNG should be ON or OFF per default.

Just came across this and wanted to share with you :)

fwyzard commented 3 months ago

Hi @ichinii, linking or not the cuRAND library should be orthogonal to using static or dynamic libraries.

Can you check if https://github.com/alpaka-group/alpaka/pull/2342 fixes the issue for you? With it, you should be able to use

fwyzard commented 3 months ago

Alpaka adds cuda as shared libraries as a default behavior. When the application is executed, it is required, that cuda is installed on the system. Imo this is restrictive default behavior and difficult to dig.

IMHO linking the CUDA runtime library statically by default is an even worse choice. This has its own set of problems, like the compatibility between the CUDA runtime and CUDA driver libraries (check https://docs.nvidia.com/deploy/cuda-compatibility/ for a fun reading), the extra size of the binaries, and the hidden dependency on libcuda.so.

We want to enable the alpaka cuda backend, but allow people to run the openmp backend instead, if they do not have an nvidia graphics card.

What about the people that have an AMD graphics card ?

I think this is a common use case.

Sure, I agree.

Personally, I think that it is more robust to implement the logic to check if CUDA is available in the application, rather than relying on the static linking of libcudart.

fwyzard commented 3 months ago

Personally, I think that it is more robust to implement the logic to check if CUDA is available in the application, rather than relying on the static linking of libcudart.

By the way, the fact that alpaka does not include an example of this approach is indeed something that we should fix 🤷🏻‍♂️ !