Closed davpapp closed 1 year ago
I forgot to mention that I'm building this project via Maven. My Maven pom.xml contains the following DJL dependencies:
<dependencyManagement>
<dependencies>
<dependency>
<groupId>ai.djl</groupId>
<artifactId>bom</artifactId>
<version>0.14.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
...
<!-- PyTorch -->
<dependency>
<groupId>ai.djl.pytorch</groupId>
<artifactId>pytorch-engine</artifactId>
</dependency>
<dependency>
<groupId>ai.djl.pytorch</groupId>
<artifactId>pytorch-native-auto</artifactId>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>ai.djl.pytorch</groupId>
<artifactId>pytorch-model-zoo</artifactId>
<scope>runtime</scope>
</dependency>
...
<dependency>
<groupId>ai.djl</groupId>
<artifactId>api</artifactId>
</dependency>
<dependency>
<groupId>ai.djl</groupId>
<artifactId>model-zoo</artifactId>
</dependency>
One more update: I tried to reinstall CUDA, and on doing so, ./gradlew debug now doesn't show CUDA as installed:
[DEBUG] - cudart library not found.
GPU Count: 0
This is weird, because nvcc --version
suggests that I do have CUDA installed.
What's going on here?
@davpapp
java.library.path: /usr/local/cuda-11.6/lib64:/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
DJL try to load libcudart.so
file from default java.library.path
, please make sure your LD_LIBRARY_PATH
is configured properly.
Here is a few things you can try:
Hey @frankfliu, I appreciate the suggestions. I was able to make some progress. I reinstalled CUDA and no longer have issues with DJL detecting the GPU.
I realized my Maven POM was incorrect, as I should be using the Maven artifact pytorch-native-cu113
instead of pytorch-native-auto
if I want to use GPU.
So I've reconfigured my POM too look like such:
<dependencyManagement>
<dependencies>
<dependency>
<groupId>ai.djl</groupId>
<artifactId>bom</artifactId>
<version>0.16.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<!-- DJL-->
<dependency>
<groupId>ai.djl</groupId>
<artifactId>api</artifactId>
</dependency>
<!-- PyTorch -->
<dependency>
<groupId>ai.djl.pytorch</groupId>
<artifactId>pytorch-engine</artifactId>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>ai.djl.pytorch</groupId>
<artifactId>pytorch-native-cu113</artifactId>
<classifier>linux-x86_64</classifier>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>ai.djl.pytorch</groupId>
<artifactId>pytorch-jni</artifactId>
<scope>runtime</scope>
</dependency>
However, when I run my application, I get the following runtime error:
Caused by: ai.djl.engine.EngineException: Failed to load PyTorch native library
at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:77)
at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:40)
at ai.djl.api@0.16.0/ai.djl.engine.Engine.getEngine(Engine.java:177)
at ai.djl.api@0.16.0/ai.djl.engine.Engine.getInstance(Engine.java:132)
Caused by: java.lang.IllegalStateException: Cannot download jni files: https://publish.djl.ai/pytorch/1.10.0/jnilib/null/linux-x86_64/cu113/libdjl_torch.so
at ai.djl.pytorch.jni.LibUtils.downloadJniLib(LibUtils.java:457)
at ai.djl.pytorch.jni.LibUtils.findJniLibrary(LibUtils.java:223)
at ai.djl.pytorch.jni.LibUtils.loadLibrary(LibUtils.java:74)
at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:50)
... 81 more
Caused by: java.io.FileNotFoundException: https://publish.djl.ai/pytorch/1.10.0/jnilib/null/linux-x86_64/cu113/libdjl_torch.so
at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1993)
at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1589)
at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:224)
at java.base/java.net.URL.openStream(URL.java:1161)
at ai.djl.pytorch.jni.LibUtils.downloadJniLib(LibUtils.java:451)
So it seems like I can't download the necessary native library? I tried manually going to the URL (https://publish.djl.ai/pytorch/1.10.0/jnilib/null/linux-x86_64/cu113/libdjl_torch.so), and it looks invalid:
<Error>
<Code>NoSuchKey</Code>
<Message>The specified key does not exist.</Message>
<Key>publish/pytorch/1.10.0/jnilib/</Key>
<RequestId>88GQ686EY1KW8MC7</RequestId>
<HostId>
3axN5rC0cRE65wjLrF3kZbQ+H/ZueqElXOWPmtIPvvMEZf8gL4scJA83ba8DZPAaO+9O9Fi9uwc=
</HostId>
</Error>
Any tips as to what might be going on? I really appreciate all the help!
I don't exactly know what went wrong, the url is invalid, it looks like failed to read version information from pytorch-engine.properties
file in the jar file.
I tested with similar pom.xml file, and I'm not able to reproduce this issue. See: https://github.com/deepjavalibrary/djl-demo/tree/master/developement/fatjar, it's working with CUDA 11.3
Can you share your project?
Could it be an issue that I have CUDA 11.6 installed? My understanding is that minor CUDA versions should be compatible (so 11.3 should be compatible with my system's 11.6).
you need down djl.ai\pytorch\1.11.0-cu113-win-x86_64,
Feel free to re-open the issue if you still have question
I come across the same question, but I am using cpu. DJL indicates that I am using libtorch_cpu.so
, but I still got the CUDA backend
error while I try to load a model trained in pytorch.
I wonder if this could be a problem of my pytorch model. The model
was trained in GPU. After that, I used the following:
scripted_model = torch.jit.script(model)
scripted_model.eval()
scripted_model = scripted_model.to("cpu")
with open(filepath,"wb") as f:
torch.jit.save(scripted_model,f)
The model has been successfully saved, and then I load the model in djl but encountered this error.
I tried to train the model using cpu from scratch, and the load procedure succeeded. Seems like a model trained in gpu cannot be used in djl cpu mode even if it has been transferred to cpu and properly saved using torch.jit.script
.
Description
I'm running inference on a model. The model was trained using YoloV5 in PyTorch, and then exported. The inference works perfectly with CPU. However, when I try to run the inference on GPU explicitly with the following code, I get this error. I wonder if there's something wrong in my configuration?
Expected Behavior
The model should run inference without crashing.
Error Message
Exception in runBots:ai.djl.engine.EngineException: Could not run 'aten::empty_strided' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::empty_strided' is only available for these backends: [CPU, Meta, BackendSelect, Named, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, UNKNOWN_TENSOR_TYPE_ID, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].
CPU: registered at aten/src/ATen/RegisterCPU.cpp:16286 [kernel] Meta: registered at aten/src/ATen/RegisterMeta.cpp:9460 [kernel] BackendSelect: registered at aten/src/ATen/RegisterBackendSelect.cpp:609 [kernel] Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback] ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:60 [backend fallback] AutogradOther: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9848 [autograd kernel] AutogradCPU: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9848 [autograd kernel] AutogradCUDA: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9848 [autograd kernel] AutogradXLA: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9848 [autograd kernel] UNKNOWN_TENSOR_TYPE_ID: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9848 [autograd kernel] AutogradMLC: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9848 [autograd kernel] AutogradHPU: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9848 [autograd kernel] AutogradNestedTensor: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9848 [autograd kernel] AutogradPrivateUse1: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9848 [autograd kernel] AutogradPrivateUse2: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9848 [autograd kernel] AutogradPrivateUse3: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9848 [autograd kernel] Tracer: registered at ../torch/csrc/autograd/generated/TraceType_0.cpp:9750 [kernel] Autocast: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:255 [backend fallback] Batched: registered at ../aten/src/ATen/BatchingRegistrations.cpp:1019 [backend fallback] VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
Steps to reproduce
(Paste the commands you ran that produced the error.)
train.py
script.python export.py --weights last.pt --include torchscript
What have you tried to solve it?
Outputs:
Environment Info
OS: Linux Fedora 35 Java: OpenJDK 17
Please run the command
./gradlew debugEnv
from the root directory of DJL (if necessary, clone DJL first). It will output information about your system, environment, and installation that can help us debug your issue. Paste the output of the command below: