Installation of the Library

i3s93 commented 1 month ago

I would like to use tools from this library in one of my projects, but I'm having some difficulties with the installation process on a Linux cluster.

I have extracted and set the library path to the shared object files for cuDSS following the directions given here. After installing CUDSS.jl, I tried to execute the following test:

using CUDA, CUDA.CUSPARSE, CUDSS, LinearAlgebra, SparseArrays
A = CuSparseMatrixCSR(sprand(100, 100, 0.1))
solver = CudssSolver(A, "G", 'F')

On the third line, I receive the following error message:

ERROR: UndefVarError: `libcudss` not defined

I'm not sure what I am doing wrong. I have also tried setting the environment variable JULIA_CUDSS_LIBRARY_PATH which is used to set the path for libcudss. Something is not being set properly. I'm using CUDA.jl (v5.4.3) and CUDSS.jl (v0.3.1) on Julia v1.9, if that helps.

amontoison commented 1 month ago

@i3s93 You don't need to install anything related to the source code of cuDSS. We have an artifact system that allow to download and install cuDSS for the users automatically (CUDSS_jll.jl).

You just need

julia> ]
pkg> add CUDSS

It's explained in the README.md but I should add a note that it also installs the shared library. You should be able to run any Julia example after that.

i3s93 commented 1 month ago

Thank you @amontoison for your rapid response. I actually started with the base installation in the README.md, but encountered the same error message. That is why I tried to manually set the path, but neither approach worked for me. Here is what I see on my end when I execute the code from my previous comment:

ERROR: UndefVarError: `libcudss` not defined
Stacktrace:
  [1] macro expansion
    @ ~/.julia/packages/CUDA/Tl08O/lib/utils/call.jl:218 [inlined]
  [2] macro expansion
    @ ~/.julia/packages/CUDSS/2E89a/src/libcudss.jl:245 [inlined]
  [3] #31
    @ ~/.julia/packages/CUDA/Tl08O/lib/utils/call.jl:35 [inlined]
  [4] retry_reclaim(f::CUDSS.var"#31#32"{Base.RefValue{Ptr{CUDSS.cudssMatrix}}, Int64, Int64, Int32, CuArray{Int32, 1, CUDA.DeviceMemory}, CuPtr{Nothing}, CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{Float64, 1, CUDA.DeviceMemory}, DataType, DataType, String, Char, Char}, retry_if::CUDSS.var"#retry_if#49")
    @ CUDA ~/.julia/packages/CUDA/Tl08O/src/memory.jl:434
  [5] check
    @ ~/.julia/packages/CUDSS/2E89a/src/error.jl:45 [inlined]
  [6] cudssMatrixCreateCsr
    @ ~/.julia/packages/CUDA/Tl08O/lib/utils/call.jl:34 [inlined]
  [7] CudssMatrix(A::CuSparseMatrixCSR{Float64, Int32}, structure::String, view::Char; index::Char)
    @ CUDSS ~/.julia/packages/CUDSS/2E89a/src/helpers.jl:81
  [8] CudssMatrix
    @ ~/.julia/packages/CUDSS/2E89a/src/helpers.jl:78 [inlined]
  [9] _
    @ ~/.julia/packages/CUDSS/2E89a/src/interfaces.jl:40 [inlined]
 [10] CudssSolver(A::CuSparseMatrixCSR{Float64, Int32}, structure::String, view::Char)
    @ CUDSS ~/.julia/packages/CUDSS/2E89a/src/interfaces.jl:39
 [11] top-level scope
    @ REPL[3]:1

amontoison commented 1 month ago

Can you remove the environment variable JULIA_CUDSS_LIBRARY_PATH and try to recompile CUDSS.jl with:

force_recompile(package_name::String) = Base.compilecache(Base.identify_package(package_name))
force_recompile("CUDSS")
using CUDSS

amontoison commented 1 month ago

If it's still not working, what is your NVIDIA GPU and operating system / architecture?

i3s93 commented 1 month ago

I tried your solution, but I'm still seeing the same problem. I'm running with an NVIDIA A100 GPU with an AMD EPYC 7763 processor. The operating system is SUSE Linux Enterprise Server 15 SP4.

amontoison commented 1 month ago

Did you install CUDSS.jl on a node with a GPU initially? I will try to force Julia to reinstall the artifacts with:

rm -rf ~/.julia/artifacts/*

amontoison commented 1 month ago

Can you also display the output of:

julia> CUDSS_jll.host_platform
Linux x86_64 {cuda=none, cuda_local=false, cxxstring_abi=cxx11, julia_version=1.10.4, libc=glibc, libgfortran_version=5.0.0, libstdcxx_version=3.4.30}

On my laptop I don't have an NVIDIA GPU so the shared library of cuDSS is not installed.

Are the NVIDIA drivers installed on your computer?

i3s93 commented 1 month ago

Okay, I have removed the artifacts as you have suggested. When I installed the package, I was on a node with the A100. Here is the output you requested:

julia> CUDSS_jll.host_platform
Linux x86_64 {cuda=12.2, cuda_local=true, cxxstring_abi=cxx11, julia_version=1.9.4, libc=glibc, libgfortran_version=5.0.0, libstdcxx_version=3.4.30}

I still see the same error message.

i3s93 commented 1 month ago

Just to follow up, I was able to install and run the code from the package locally on a laptop with an NVIDIA GPU. So far, I have only been able to see this issue when I try to install the package on a remote cluster. I will reach out to the system administrators and see if something on their end is disrupting the installation.

carstenbauer commented 1 month ago

Are you using a module on the cluster to get Julia? (I.e. module load ...) If so, can you post the output of module show ...?

It seems that you're trying to use a local cuda. Assuming that wasn't your intention and own doing, it might be a global preference that is set when you load a Julia module.

Btw, which cluster is this?

i3s93 commented 1 month ago

@carstenbauer: This is on Perlmutter, if that helps. Here is the output of module list

Currently Loaded Modules:
  1) craype-x86-milan     3) craype-network-ofi                      5) PrgEnv-gnu/8.5.0   7) cray-libsci/23.12.5   9) craype/2.7.30    11) perftools-base/23.12.0  13) craype-accel-nvidia80  15) julia/1.9.4
  2) libfabric/1.15.2.0   4) xpmem/2.6.2-2.5_2.38__gd067c3f.shasta   6) cray-dsmml/0.2.2   8) cray-mpich/8.1.28    10) gcc-native/12.3  12) cpe/23.12               14) gpu/1.0                16) cudatoolkit/12.2 (g)

  Where:
   g:  built for GPU

I can run any of my Julia CUDA codes fine without the CUDA modules, so the CUDA Toolkit is not necessary. I see the same error regardless of whether not this module is loaded.

carstenbauer commented 1 month ago

@i3s93 I just tested this on Perlmutter.

If I use the julia module (module load julia) I can reproduce your error message.

However, if I

unset JULIA_LOAD_PATH (to get rid of the global Julia preferences set by the module)
and module unload cudatoolkit (not necessary but better to avoid potential conflicts),

your test above works without any issues in a clean Julia environment that just has CUDA and CUDSS in it.

JBlaschke commented 1 month ago

The environment in the global JULIA_LOAD_PATH is used to specify the CUDA version (to stop Julia from installing a version of the CUDA runtime that is incompatible with the system) and the MPI configuration. I suspect the later has no effect here.

@i3s93 did unsetting JULIA_LOAD_PATH cause pkg> add CUDSS to install a newer version of CUDA?

carstenbauer commented 1 month ago

@i3s93 did unsetting JULIA_LOAD_PATH cause pkg> add CUDSS to install a newer version of CUDA?

@JBlaschke I assume the question was for me, because I was the one that did the (successful) test with unset JULIA_LOAD_PATH. And to answer it, yes, afterwards I get 12.5 (instead of 12.2):

julia> CUDA.versioninfo()
CUDA runtime 12.5, artifact installation
CUDA driver 12.0
NVIDIA driver 525.105.17

CUDA libraries:
- CUBLAS: 12.5.3
- CURAND: 10.3.6
- CUFFT: 11.2.3
- CUSOLVER: 11.6.3
- CUSPARSE: 12.5.1
- CUPTI: 2024.2.1 (API 23.0.0)
- NVML: 12.0.0+525.105.17

Julia packages:
- CUDA: 5.4.3
- CUDA_Driver_jll: 0.9.1+1
- CUDA_Runtime_jll: 0.14.1+0

Toolchain:
- Julia: 1.9.4
- LLVM: 14.0.6

1 device:
  0: NVIDIA A100-PCIE-40GB (sm_80, 38.984 GiB / 40.000 GiB available)

For comparison, this is if I don't unset and don't unload the cudatoolkit module:

julia> CUDA.versioninfo()
CUDA runtime 12.2, local installation
CUDA driver 12.2
NVIDIA driver 525.105.17

CUDA libraries:
- CUBLAS: 12.2.1
- CURAND: 10.3.3
- CUFFT: 11.0.8
- CUSOLVER: 11.5.0
- CUSPARSE: 12.1.1
- CUPTI: 2023.2.0 (API 20.0.0)
- NVML: 12.0.0+525.105.17

Julia packages:
- CUDA: 5.4.3
- CUDA_Driver_jll: 0.9.1+1
- CUDA_Runtime_jll: 0.14.1+0
- CUDA_Runtime_Discovery: 0.3.4

Toolchain:
- Julia: 1.9.4
- LLVM: 14.0.6

Preferences:
- CUDA_Runtime_jll.version: 12.2
- CUDA_Runtime_jll.local: true

1 device:
  0: NVIDIA A100-PCIE-40GB (sm_80, 38.984 GiB / 40.000 GiB available)

JBlaschke commented 1 month ago

Thanks @carstenbauer for checking. So libcudss doesn't appear to be in the cudatoolkit module. I'll see if it's installed anywhere.

One more thing: does the artifact even work on a compute? For previous versions we would get segfaults.

JBlaschke commented 1 month ago

It looks like we don't have a version on Perlmutter yet. I might go and check the artifact install of CUDA. If that doesn't work I'd need to develop a module.

amontoison commented 1 month ago

@JBlaschke Do you mean the artifact of cuDSS? The recent version 0.3.0 works fine without segmentation faults.

JBlaschke commented 1 month ago

@amontoison no I meant running CUDA.jl using the artifact CUDA (instead of the one provided by the OS)

JBlaschke commented 1 month ago

On Perlmutter

i3s93 commented 1 month ago

@carstenbauer Thank you for taking the time to help resolve this issue! I can also confirm that unsetting unset JULIA_LOAD_PATH worked for me.

@JBlaschke Thank you for your help as well! My tests with cuDSS are a small scale, so I am fine with unsetting the environment variable until a better solution becomes available.

@amontoison I greatly appreciate the timely feedback and for having a look at this problem. Since this does not appear to be an issue with CUDSS.jl, I'm fine with closing this issue, unless the others would like to continue the discussion!

amontoison commented 1 month ago

Am I wondering how relevant it will be to detect a local installation of cuDSS: https://github.com/exanauts/CUDSS.jl/issues/55

cuDSS is still in preview so every minor release breaks the API, and it requires the local installation to be always the most recent version, which is probably hard to maintain.

JBlaschke commented 1 month ago

@amontoison in the past CUDA would not work at all unless you used the local install on Perlmutter. It might be the case that this is no longer necessary.

I haven't had a chance to test this. Will do so soon. If it is the case that running CUDA_jll is unstable on Perlmutter, then we have no choice but to also use a local CUDSS install...

amontoison commented 3 weeks ago

@carstenbauer @JBlaschke @i3s93 May I ask one of you to test my PR #57? It should help to detect a local install on Perlmutter.

Do you know why Tim checks whether precompiling in this function __init__, which I based my PR on?
Is it to avoid an error when precompiling on a cluster node without GPUs?

exanauts / CUDSS.jl

Installation of the Library #54