Exception when running model on GPU: Exception in runBots:ai.djl.engine.EngineException: Could not run 'aten::empty_strided' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build).

davpapp commented 2 years ago

Description

I'm running inference on a model. The model was trained using YoloV5 in PyTorch, and then exported. The inference works perfectly with CPU. However, when I try to run the inference on GPU explicitly with the following code, I get this error. I wonder if there's something wrong in my configuration?

this.criteria = Criteria.builder().setTypes(Image.class, DetectedObjects.class)
  .optModelPath(Paths.get(model_directory))
  .optProgress(new ProgressBar())
  .optDevice(Device.gpu()) // explicitly set to use GPU
  .optTranslator(this.translator)
  .build();

Expected Behavior

The model should run inference without crashing.

Error Message

Exception in runBots:ai.djl.engine.EngineException: Could not run 'aten::empty_strided' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::empty_strided' is only available for these backends: [CPU, Meta, BackendSelect, Named, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, UNKNOWN_TENSOR_TYPE_ID, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].

CPU: registered at aten/src/ATen/RegisterCPU.cpp:16286 [kernel] Meta: registered at aten/src/ATen/RegisterMeta.cpp:9460 [kernel] BackendSelect: registered at aten/src/ATen/RegisterBackendSelect.cpp:609 [kernel] Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback] ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:60 [backend fallback] AutogradOther: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9848 [autograd kernel] AutogradCPU: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9848 [autograd kernel] AutogradCUDA: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9848 [autograd kernel] AutogradXLA: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9848 [autograd kernel] UNKNOWN_TENSOR_TYPE_ID: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9848 [autograd kernel] AutogradMLC: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9848 [autograd kernel] AutogradHPU: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9848 [autograd kernel] AutogradNestedTensor: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9848 [autograd kernel] AutogradPrivateUse1: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9848 [autograd kernel] AutogradPrivateUse2: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9848 [autograd kernel] AutogradPrivateUse3: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9848 [autograd kernel] Tracer: registered at ../torch/csrc/autograd/generated/TraceType_0.cpp:9750 [kernel] Autocast: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:255 [backend fallback] Batched: registered at ../aten/src/ATen/BatchingRegistrations.cpp:1019 [backend fallback] VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]

Steps to reproduce

(Paste the commands you ran that produced the error.)

Train model using YoloV5 train.py script.
Export model using YoloV5 export.py: python export.py --weights last.pt --include torchscript
Try to load model with Java DJL.
Observe that model loads correctly with CPU, but not with GPU.

What have you tried to solve it?

I checked if there was a GPU available for use with DJL. The following commands printed the following output:

Device d = Device.gpu(0);
System.out.println("Device: " + d + ", id:" + d.getDeviceId() + ", type" + d.getDeviceType());
MemoryUsage mem = CudaUtils.getGpuMemory(d);
System.out.println("max memory:" + mem.getMax());
System.out.println("Cuda version:" + CudaUtils.getCudaVersion());

Outputs:

Device: gpu(0), id:0, typegpu
max memory:8369733632
Cuda version:11060

Ran djl-bench on a generic model from the internet. The GPU worked there (and was a lot faster than the CPU).
I installed CUDA via the DNF package manager on Fedora by following these instructions: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html. One thing I noticed during install is that the installation installed OpenJDK 11, even though I already had OpenJDK 17.
Read through https://github.com/pytorch/pytorch/issues/71402, but this did not seem applicable to my issue.

Environment Info

OS: Linux Fedora 35 Java: OpenJDK 17

Please run the command ./gradlew debugEnv from the root directory of DJL (if necessary, clone DJL first). It will output information about your system, environment, and installation that can help us debug your issue. Paste the output of the command below:

java.vm.specification.vendor: Oracle Corporation
java.specification.name: Java Platform API Specification
sun.management.compiler: HotSpot 64-Bit Tiered Compilers
java.runtime.version: 17.0.2+8
user.name: dpapp
path.separator: :
os.version: 5.17.4-200.fc35.x86_64
java.runtime.name: OpenJDK Runtime Environment
file.encoding: UTF-8
java.vm.name: OpenJDK 64-Bit Server VM
java.vendor.version: 21.9
java.vendor.url.bug: https://bugzilla.redhat.com/enter_bug.cgi?product=Fedora&component=java-17-openjdk&version=35
java.io.tmpdir: /tmp
org.gradle.internal.http.socketTimeout: 120000
java.version: 17.0.2
user.dir: /home/dpapp/Documents/djl/integration
os.arch: amd64
java.vm.specification.name: Java Virtual Machine Specification
native.encoding: UTF-8
java.library.path: /usr/local/cuda-11.6/lib64:/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
java.vm.info: mixed mode, sharing
java.vendor: Red Hat, Inc.
java.vm.version: 17.0.2+8
sun.io.unicode.encoding: UnicodeLittle
library.jansi.path: /home/dpapp/.gradle/native/jansi/1.18/linux64
java.class.version: 61.0
org.gradle.internal.publish.checksums.insecure: true

--------- Environment Variables ---------
PATH: /usr/local/cuda-11.6/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/var/lib/snapd/snap/bin:/usr/lib/jvm/java-17-openjdk-17.0.2.0.8-1.fc35.x86_64/bin
XAUTHORITY: /run/user/1000/gdm/Xauthority
HISTCONTROL: erasedups:ignoreboth
XMODIFIERS: @im=ibus
GDMSESSION: gnome-xorg
XDG_DATA_DIRS: /home/dpapp/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share/:/usr/share/:/var/lib/snapd/desktop
DBUS_SESSION_BUS_ADDRESS: unix:path=/run/user/1000/bus
XDG_CURRENT_DESKTOP: GNOME
MAIL: /var/spool/mail/dpapp
SSH_AGENT_PID: 72972
LD_LIBRARY_PATH: /usr/local/cuda-11.6/lib64
COLORTERM: truecolor
USERNAME: dpapp
SESSION_MANAGER: local/unix:@/tmp/.ICE-unix/73048,unix/unix:/tmp/.ICE-unix/73048
LOGNAME: dpapp
PWD: /home/dpapp/Documents/djl
HISTIGNORE: &:[ ]*:exit:ls:bg:fg:history:clear
LESSOPEN: ||/usr/bin/lesspipe.sh %s
SHELL: /bin/bash
PAGER: less
STEAM_FRAME_FORCE_CLOSE: 1
OLDPWD: /home/dpapp/Documents/djl
GNOME_TERMINAL_SCREEN: /org/gnome/Terminal/screen/679bc443_0c0e_46a1_a3bf_c58d189c4c5a
DEBUGINFOD_URLS: https://debuginfod.fedoraproject.org/ 
LESS: -R
LC_CTYPE: en_US.UTF-8
SYSTEMD_EXEC_PID: 73079
LS_COLORS: rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
XDG_SESSION_DESKTOP: gnome-xorg
SHLVL: 1
QT_IM_MODULE: ibus
HISTSIZE: 500000
JAVA_HOME: /usr/lib/jvm/java-17-openjdk-17.0.2.0.8-1.fc35.x86_64
TERM: xterm-256color
GNOME_TERMINAL_SERVICE: :1.170
LANG: en_US.UTF-8
MOZ_GMP_PATH: /usr/lib64/mozilla/plugins/gmp-gmpopenh264/system-installed
XDG_SESSION_TYPE: x11
DISPLAY: :1
which_declare: declare -f
XDG_SESSION_CLASS: user
GDM_LANG: en_US.UTF-8
OSH: /home/dpapp/.oh-my-bash
LSCOLORS: Gxfxcxdxdxegedabagacad
DESKTOP_SESSION: gnome-xorg
USER: dpapp
XDG_MENU_PREFIX: gnome-
VTE_VERSION: 6602
WINDOWPATH: 2
SSH_AUTH_SOCK: /run/user/1000/keyring/ssh
SDL_VIDEO_MINIMIZE_ON_FOCUS_LOSS: 0
EDITOR: /usr/bin/nano
HOSTNAME: fedora
XDG_RUNTIME_DIR: /run/user/1000
HOME: /home/dpapp

-------------- Directories --------------
temp directory: /tmp
DJL cache directory: /home/dpapp/.djl.ai
Engine cache directory: /home/dpapp/.djl.ai

------------------ CUDA -----------------
GPU Count: 1
CUDA: 116
ARCH: 86
GPU(0) memory used: 920256512 bytes

----------------- Engines ---------------
DJL version: 0.17.0
Default Engine: MXNet
[WARN ] - No matching cuda flavor for linux found: cu116mkl/sm_86.
[DEBUG] - Using cache dir: /home/dpapp/.djl.ai/mxnet/1.9.0-mkl-linux-x86_64
[INFO ] - Downloading libgfortran.so.3 ...
[INFO ] - Downloading libgomp.so.1 ...
[INFO ] - Downloading libquadmath.so.0 ...
[INFO ] - Downloading libopenblas.so.0 ...
[INFO ] - Downloading libmxnet.so ...
[DEBUG] - Loading mxnet library from: /home/dpapp/.djl.ai/mxnet/1.9.0-mkl-linux-x86_64/libmxnet.so
Default Device: cpu()
PyTorch: 2
MXNet: 0
XGBoost: 10
TensorFlow: 3

--------------- Hardware --------------
Available processors (cores): 16
Byte Order: LITTLE_ENDIAN
Free memory (bytes): 248695272
Maximum memory (bytes): 4139778048
Total memory available to JVM (bytes): 264241152
Heap committed: 264241152
Heap nonCommitted: 31522816
GCC: 
gcc (GCC) 11.3.1 20220421 (Red Hat 11.3.1-2)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

davpapp commented 2 years ago

I forgot to mention that I'm building this project via Maven. My Maven pom.xml contains the following DJL dependencies:

<dependencyManagement>
<dependencies>
    <dependency>
        <groupId>ai.djl</groupId>
        <artifactId>bom</artifactId>
        <version>0.14.0</version>
        <type>pom</type>
        <scope>import</scope>
    </dependency>
</dependencies>
</dependencyManagement>
...

<!-- PyTorch -->
<dependency>
    <groupId>ai.djl.pytorch</groupId>
    <artifactId>pytorch-engine</artifactId>
</dependency>
<dependency>
    <groupId>ai.djl.pytorch</groupId>
    <artifactId>pytorch-native-auto</artifactId>
    <scope>runtime</scope>
</dependency>
<dependency>
    <groupId>ai.djl.pytorch</groupId>
    <artifactId>pytorch-model-zoo</artifactId>
    <scope>runtime</scope>
</dependency>
...
<dependency>
    <groupId>ai.djl</groupId>
    <artifactId>api</artifactId>
</dependency>
<dependency>
    <groupId>ai.djl</groupId>
    <artifactId>model-zoo</artifactId>
</dependency>

davpapp commented 2 years ago

One more update: I tried to reinstall CUDA, and on doing so, ./gradlew debug now doesn't show CUDA as installed:

[DEBUG] - cudart library not found.
GPU Count: 0

This is weird, because nvcc --version suggests that I do have CUDA installed. What's going on here?

frankfliu commented 2 years ago

@davpapp

java.library.path: /usr/local/cuda-11.6/lib64:/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib

DJL try to load libcudart.so file from default java.library.path, please make sure your LD_LIBRARY_PATH is configured properly.

frankfliu commented 2 years ago

Here is a few things you can try:

run inference with GPU on python, make sure the same model is working on GPU with python
trace the model on GPU and save it as jitscript model, then try to load with DJL

davpapp commented 2 years ago

Hey @frankfliu, I appreciate the suggestions. I was able to make some progress. I reinstalled CUDA and no longer have issues with DJL detecting the GPU.

I realized my Maven POM was incorrect, as I should be using the Maven artifact pytorch-native-cu113 instead of pytorch-native-auto if I want to use GPU.

So I've reconfigured my POM too look like such:

<dependencyManagement>
      <dependencies>
        <dependency>
              <groupId>ai.djl</groupId>
              <artifactId>bom</artifactId>
              <version>0.16.0</version>
              <type>pom</type>
              <scope>import</scope>
        </dependency>
      </dependencies>
</dependencyManagement>

<!-- DJL-->
<dependency>
      <groupId>ai.djl</groupId>
      <artifactId>api</artifactId>
      </dependency>

<!-- PyTorch -->
<dependency>
      <groupId>ai.djl.pytorch</groupId>
      <artifactId>pytorch-engine</artifactId>
      <scope>runtime</scope>
</dependency>
<dependency>
      <groupId>ai.djl.pytorch</groupId>
      <artifactId>pytorch-native-cu113</artifactId>
      <classifier>linux-x86_64</classifier>
      <scope>runtime</scope>
</dependency>
<dependency>
      <groupId>ai.djl.pytorch</groupId>
      <artifactId>pytorch-jni</artifactId>
      <scope>runtime</scope>
</dependency>

However, when I run my application, I get the following runtime error:

Caused by: ai.djl.engine.EngineException: Failed to load PyTorch native library
    at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:77)
    at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:40)
    at ai.djl.api@0.16.0/ai.djl.engine.Engine.getEngine(Engine.java:177)
    at ai.djl.api@0.16.0/ai.djl.engine.Engine.getInstance(Engine.java:132)
Caused by: java.lang.IllegalStateException: Cannot download jni files: https://publish.djl.ai/pytorch/1.10.0/jnilib/null/linux-x86_64/cu113/libdjl_torch.so
    at ai.djl.pytorch.jni.LibUtils.downloadJniLib(LibUtils.java:457)
    at ai.djl.pytorch.jni.LibUtils.findJniLibrary(LibUtils.java:223)
    at ai.djl.pytorch.jni.LibUtils.loadLibrary(LibUtils.java:74)
    at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:50)
    ... 81 more
Caused by: java.io.FileNotFoundException: https://publish.djl.ai/pytorch/1.10.0/jnilib/null/linux-x86_64/cu113/libdjl_torch.so
    at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1993)
    at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1589)
    at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:224)
    at java.base/java.net.URL.openStream(URL.java:1161)
    at ai.djl.pytorch.jni.LibUtils.downloadJniLib(LibUtils.java:451)

So it seems like I can't download the necessary native library? I tried manually going to the URL (https://publish.djl.ai/pytorch/1.10.0/jnilib/null/linux-x86_64/cu113/libdjl_torch.so), and it looks invalid:

<Error>
<Code>NoSuchKey</Code>
<Message>The specified key does not exist.</Message>
<Key>publish/pytorch/1.10.0/jnilib/</Key>
<RequestId>88GQ686EY1KW8MC7</RequestId>
<HostId>
3axN5rC0cRE65wjLrF3kZbQ+H/ZueqElXOWPmtIPvvMEZf8gL4scJA83ba8DZPAaO+9O9Fi9uwc=
</HostId>
</Error>

Any tips as to what might be going on? I really appreciate all the help!

frankfliu commented 2 years ago

I don't exactly know what went wrong, the url is invalid, it looks like failed to read version information from pytorch-engine.properties file in the jar file.

I tested with similar pom.xml file, and I'm not able to reproduce this issue. See: https://github.com/deepjavalibrary/djl-demo/tree/master/developement/fatjar, it's working with CUDA 11.3

Can you share your project?

davpapp commented 2 years ago

Could it be an issue that I have CUDA 11.6 installed? My understanding is that minor CUDA versions should be compatible (so 11.3 should be compatible with my system's 11.6).

878647402qq commented 2 years ago

you need down djl.ai\pytorch\1.11.0-cu113-win-x86_64,

frankfliu commented 1 year ago

Feel free to re-open the issue if you still have question

RidiculousDoge commented 1 year ago

I come across the same question, but I am using cpu. DJL indicates that I am using libtorch_cpu.so, but I still got the CUDA backend error while I try to load a model trained in pytorch.

I wonder if this could be a problem of my pytorch model. The model was trained in GPU. After that, I used the following:

scripted_model = torch.jit.script(model)
scripted_model.eval()
scripted_model = scripted_model.to("cpu")
with open(filepath,"wb") as f:
    torch.jit.save(scripted_model,f)

The model has been successfully saved, and then I load the model in djl but encountered this error.

I tried to train the model using cpu from scratch, and the load procedure succeeded. Seems like a model trained in gpu cannot be used in djl cpu mode even if it has been transferred to cpu and properly saved using torch.jit.script.

deepjavalibrary / djl

Description

Expected Behavior

Error Message

Steps to reproduce

What have you tried to solve it?

Environment Info