bytedeco / javacpp-presets

The missing Java distribution of native C++ libraries
Other
2.65k stars 736 forks source link

Problems deploying pytorch 2.1.2-1.5.10 #1468

Closed jxtps closed 6 months ago

jxtps commented 6 months ago

I'm running into some problems deploying the (as of this writing latest) version of https://github.com/bytedeco/javacpp-presets/tree/master/pytorch 2.1.2-1.5.10.

If I try to use libtorch (https://download.pytorch.org/libtorch/cu121/libtorch-shared-with-deps-2.1.2%2Bcu121.zip), then I get:

java: symbol lookup error: /mnt/lib/cache/org.bytedeco.pytorch-2.1.2-1.5.10-linux-x86_64.jar/org/bytedeco/pytorch/linux-x86_64/libjnitorch.so: undefined symbol: _ZNK3c106Device3strB5cxx11Ev

Not sure where _ZNK3c106Device3strB5cxx11Ev would be defined? Hmm... maybe I should try the cxx11 ABI version? (i.e. https://download.pytorch.org/libtorch/cu121/libtorch-cxx11-abi-shared-with-deps-2.1.2%2Bcu121.zip)

If I switch to using

    "org.bytedeco" % "pytorch-platform" % "2.1.2-1.5.10", // https://mvnrepository.com/artifact/org.bytedeco/pytorch-platform
    "org.bytedeco" % "pytorch-platform-gpu" % "2.1.2-1.5.10", // https://mvnrepository.com/artifact/org.bytedeco/pytorch-platform-gpu
    "org.bytedeco" % "cuda-platform-redist" % "12.3-8.9-1.5.10", // https://mvnrepository.com/artifact/org.bytedeco/cuda-platform-redist
    "org.bytedeco" % "mkl-platform-redist" % "2024.0-1.5.10", // https://mvnrepository.com/artifact/org.bytedeco/mkl

then I get:

java.lang.RuntimeException: nvrtc: error: failed to open libnvrtc-builtins.so.12.3.
  Make sure that libnvrtc-builtins.so.12.3 is installed correctly.
nvrtc compilation failed:
...

when I try to actually load a pytorch model.

Looking in /mnt/lib/cache/org.bytedeco.cuda-12.3-8.9-1.5.10-linux-x86_64-redist.jar/org/bytedeco/cuda/linux-x86_64$ I can see:

drwxr-xr-x 2 ubuntu ubuntu      4096 Feb 13 19:37 ./
drwxr-xr-x 3 ubuntu ubuntu      4096 Feb 13 19:37 ../
lrwxrwxrwx 1 ubuntu ubuntu        15 Feb 13 19:37 libcublas.so -> libcublas.so.12
-rw-r--r-- 1 ubuntu ubuntu 106679344 Jan 26 13:29 libcublas.so.12
lrwxrwxrwx 1 ubuntu ubuntu        17 Feb 13 19:37 libcublasLt.so -> libcublasLt.so.12
-rw-r--r-- 1 ubuntu ubuntu 518358624 Jan 26 13:29 libcublasLt.so.12
lrwxrwxrwx 1 ubuntu ubuntu        15 Feb 13 19:37 libcudart.so -> libcudart.so.12
-rw-r--r-- 1 ubuntu ubuntu    703808 Jan 26 13:29 libcudart.so.12
lrwxrwxrwx 1 ubuntu ubuntu        13 Feb 13 19:37 libcudnn.so -> libcudnn.so.8
-rw-r--r-- 1 ubuntu ubuntu    142008 Jan 26 13:29 libcudnn.so.8
lrwxrwxrwx 1 ubuntu ubuntu        23 Feb 13 19:37 libcudnn_adv_infer.so -> libcudnn_adv_infer.so.8
-rw-r--r-- 1 ubuntu ubuntu 125098064 Jan 26 13:29 libcudnn_adv_infer.so.8
lrwxrwxrwx 1 ubuntu ubuntu        23 Feb 13 19:37 libcudnn_adv_train.so -> libcudnn_adv_train.so.8
-rw-r--r-- 1 ubuntu ubuntu 116106128 Jan 26 13:29 libcudnn_adv_train.so.8
lrwxrwxrwx 1 ubuntu ubuntu        23 Feb 13 19:37 libcudnn_cnn_infer.so -> libcudnn_cnn_infer.so.8
-rw-r--r-- 1 ubuntu ubuntu 570673312 Jan 26 13:29 libcudnn_cnn_infer.so.8
lrwxrwxrwx 1 ubuntu ubuntu        23 Feb 13 19:37 libcudnn_cnn_train.so -> libcudnn_cnn_train.so.8
-rw-r--r-- 1 ubuntu ubuntu 125431984 Jan 26 13:29 libcudnn_cnn_train.so.8
lrwxrwxrwx 1 ubuntu ubuntu        23 Feb 13 19:37 libcudnn_ops_infer.so -> libcudnn_ops_infer.so.8
-rw-r--r-- 1 ubuntu ubuntu  90829248 Jan 26 13:29 libcudnn_ops_infer.so.8
lrwxrwxrwx 1 ubuntu ubuntu        23 Feb 13 19:37 libcudnn_ops_train.so -> libcudnn_ops_train.so.8
-rw-r--r-- 1 ubuntu ubuntu  70926584 Jan 26 13:29 libcudnn_ops_train.so.8
lrwxrwxrwx 1 ubuntu ubuntu        14 Feb 13 19:37 libcufft.so -> libcufft.so.11
-rw-r--r-- 1 ubuntu ubuntu 177827520 Jan 26 13:30 libcufft.so.11
lrwxrwxrwx 1 ubuntu ubuntu       122 Feb 13 19:37 libcupti.so.12 -> /mnt/lib/cache/org.bytedeco.pytorch-2.1.2-1.5.10-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libcupti.so.12
lrwxrwxrwx 1 ubuntu ubuntu        15 Feb 13 19:37 libcurand.so -> libcurand.so.10
-rw-r--r-- 1 ubuntu ubuntu  96259504 Jan 26 13:30 libcurand.so.10
lrwxrwxrwx 1 ubuntu ubuntu        17 Feb 13 19:37 libcusolver.so -> libcusolver.so.11
-rw-r--r-- 1 ubuntu ubuntu 115640600 Jan 26 13:30 libcusolver.so.11
lrwxrwxrwx 1 ubuntu ubuntu        17 Feb 13 19:37 libcusparse.so -> libcusparse.so.12
-rw-r--r-- 1 ubuntu ubuntu 267184960 Jan 26 13:30 libcusparse.so.12
lrwxrwxrwx 1 ubuntu ubuntu        39 Feb 13 19:37 libgcc_s.so.1 -> /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
lrwxrwxrwx 1 ubuntu ubuntu       119 Feb 13 19:37 libgfortran.so.5 -> /mnt/lib/cache/org.bytedeco.openblas-0.3.26-1.5.10-linux-x86_64.jar/org/bytedeco/openblas/linux-x86_64/libgfortran.so.5
lrwxrwxrwx 1 ubuntu ubuntu        38 Feb 13 19:37 libgomp.so.1 -> /usr/lib/x86_64-linux-gnu/libgomp.so.1
lrwxrwxrwx 1 ubuntu ubuntu       119 Feb 13 19:37 libiomp5.so -> /mnt/lib/cache/org.bytedeco.pytorch-2.1.2-1.5.10-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libiomp5.so
lrwxrwxrwx 1 ubuntu ubuntu        12 Feb 13 19:37 libnccl.so -> libnccl.so.2
-rw-r--r-- 1 ubuntu ubuntu 209818616 Jan 26 13:30 libnccl.so.2
lrwxrwxrwx 1 ubuntu ubuntu        18 Feb 13 19:37 libnvJitLink.so -> libnvJitLink.so.12
-rw-r--r-- 1 ubuntu ubuntu  52190720 Jan 26 13:30 libnvJitLink.so.12
lrwxrwxrwx 1 ubuntu ubuntu        18 Feb 13 19:37 libnvToolsExt.so -> libnvToolsExt.so.1
-rw-r--r-- 1 ubuntu ubuntu     40136 Jan 26 13:39 libnvToolsExt.so.1
lrwxrwxrwx 1 ubuntu ubuntu        14 Feb 13 19:37 libnvrtc.so -> libnvrtc.so.12
-rw-r--r-- 1 ubuntu ubuntu  60792048 Jan 26 13:39 libnvrtc.so.12
lrwxrwxrwx 1 ubuntu ubuntu       119 Feb 13 19:37 libopenblas.so.0 -> /mnt/lib/cache/org.bytedeco.openblas-0.3.26-1.5.10-linux-x86_64.jar/org/bytedeco/openblas/linux-x86_64/libopenblas.so.0
lrwxrwxrwx 1 ubuntu ubuntu        42 Feb 13 19:37 libquadmath.so.0 -> /usr/lib/x86_64-linux-gnu/libquadmath.so.0
lrwxrwxrwx 1 ubuntu ubuntu       119 Feb 13 19:37 libtbb.so.2 -> /mnt/lib/cache/org.bytedeco.pytorch-2.1.2-1.5.10-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtbb.so.2

i.e. libnvrtc.so is there, but no libnvrtc-builtins.so.12.3.

???

jxtps commented 6 months ago

Switching to the cxx11 ABI version solved the issue, but the recommended usage style with cuda-platform-redist should probably be fixed regardless?

HGuillemet commented 6 months ago

nvrtc-builtins is apparently missing from the list of libraries to preload for pytorch. I'll add it to PR #1466 . In the meantime, you can manually load class org.bytedeco.cuda.global.nvrtc from your code. But I'm not sure it will be enough because it's used by libnvrtc.so and cuda libraries do not have their RPATH set to $ORIGIN. @saudet, do you have a mean to patch the cuda libraries and add the RPATH before building the native jar ?

HGuillemet commented 6 months ago

Commit pushed to preload nvrt-plugins. Forget what I said about RPATH. preload actually loads the library, not only extracts it in the cache, so setting the RPATH shouldn't be needed.