dattalab / keypoint-moseq

https://keypoint-moseq.readthedocs.io
Other
68 stars 28 forks source link

HPC issues "Could not load dynamic library libcudart" #155

Closed KarinHellevik closed 1 month ago

KarinHellevik commented 1 month ago

I installed Keypoint-Moseq using pip into my folder on the HPC (could not get the conda method of installation to work, was unable to find _cuda) and I have this error:

2024-07-16 13:54:47.379415: W external/org_tensorflow/tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /cm/local/apps/python37/lib 2024-07-16 13:54:47.418905: W external/org_tensorflow/tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /cm/local/apps/python37/lib 2024-07-16 13:54:47.421869: W external/org_tensorflow/tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /cm/local/apps/python37/lib 2024-07-16 13:54:51.957920: W external/org_tensorflow/tensorflow/compiler/xla/stream_executor/gpu/asm_compiler.cc:85] Couldn't get ptxas version string: INTERNAL: Couldn't invoke ptxas --version 2024-07-16 13:54:51.958518: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:454] ptxas returned an error during compilation of ptx to sass: 'INTERNAL: Failed to launch ptxas' If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided. /cm/local/apps/uge/var/spool.p6444/bamgpu03/job_scripts/7980240: line 10: 2253365 Aborted (core dumped) python kpmsmodelfitandreindex.py

calebweinreb commented 1 month ago

Please provide more detail. Did you install cuda and cudnn? How? What kind of GPU are you using?

KarinHellevik commented 1 month ago

from pip list jaxlib 0.3.22+cuda11.cudnn82

the GPUs from our HPC documentation are "Tesla V100 GPUs with the NVLINK interconnect"

calebweinreb commented 1 month ago

So it sounds like you didn't install CUDA of CUDNN. You either have to install those globally or use the conda install method in the keypoint moseq docs.

KarinHellevik commented 1 month ago

I was not able to use conda to install keypoint moseq. I get this error ` LibMambaUnsatisfiableError: Encountered problems while solving:

Could not solve for environment specs The following package could not be installed └─ jaxlib 0.3.22 cuda is not installable because it requires └─ __cuda, which is missing on the system.`

And these are the CUDA toolkit and cuDNN modules available on the HPC system image

calebweinreb commented 1 month ago

Hmm are you using conda or mamba to install for the conda route?

And great so in principle the pip should work. Among those cuda modules that are available, which did you actually load?

KarinHellevik commented 1 month ago

using conda

im loading cuDNN 8.1 and cuda 11.2 toolkit

calebweinreb commented 1 month ago

Hmm its possible you need cudnn 8.2 or higher. Also when do you get the above error? In general, please provide as much detail as possible when posting and commenting on issues. I'd also recommend that you google the issue and search for related issue posts on this repo.

This post could be helpful for the conda install https://stackoverflow.com/questions/74836151/nothing-provides-cuda-needed-by-tensorflow-2-10-0-cuda112py310he87a039-0