Can't load a model on GPU

et-blanc commented 2 years ago

Description

Hi,

I'm trying to make inference with my GPU (NVIDIA GeForce RTX 3090) using DJL 0.15.0. However, I can't load any model on my GPU.

I'm new to working with Java and DJL, so any help is very much appreciated.

Thank you.

Error Message

Exception in thread "main" ai.djl.engine.EngineException: 
Could not run 'aten::empty_strided' with arguments from the 'CUDA' backend. 
This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). 
If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions.
'aten::empty_strided' is only available for these backends: [CPU, Meta, BackendSelect, Python, Named, Conjugate, Negative, ADInplaceOrView, 
AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, 
AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, UNKNOWN_TENSOR_TYPE_ID, Autocast, Batched, VmapMode].

CPU: registered at aten/src/ATen/RegisterCPU.cpp:18433 [kernel]
Meta: registered at aten/src/ATen/RegisterMeta.cpp:12703 [kernel]
BackendSelect: registered at aten/src/ATen/RegisterBackendSelect.cpp:665 [kernel]
Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:47 [backend fallback]
Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: fallthrough registered at ../aten/src/ATen/ConjugateFallback.cpp:22 [kernel]
Negative: fallthrough registered at ../aten/src/ATen/native/NegateFallback.cpp:22 [kernel]
ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:64 [backend fallback]
AutogradOther: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10491 [autograd kernel]
AutogradCPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10491 [autograd kernel]
AutogradCUDA: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10491 [autograd kernel]
AutogradXLA: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10491 [autograd kernel]
AutogradLazy: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10491 [autograd kernel]
AutogradXPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10491 [autograd kernel]
AutogradMLC: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10491 [autograd kernel]
AutogradHPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10491 [autograd kernel]
AutogradNestedTensor: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10491 [autograd kernel]
AutogradPrivateUse1: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10491 [autograd kernel]
AutogradPrivateUse2: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10491 [autograd kernel]
AutogradPrivateUse3: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10491 [autograd kernel]
Tracer: registered at ../torch/csrc/autograd/generated/TraceType_2.cpp:11425 [kernel]
UNKNOWN_TENSOR_TYPE_ID: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:466 [backend fallback]
Autocast: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:305 [backend fallback]
Batched: registered at ../aten/src/ATen/BatchingRegistrations.cpp:1016 [backend fallback]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]

        at ai.djl.pytorch.jni.PyTorchLibrary.moduleLoad(Native Method)
        at ai.djl.pytorch.jni.JniUtils.loadModule(JniUtils.java:1360)
        at ai.djl.pytorch.engine.PtModel.load(PtModel.java:89)
        at ai.djl.repository.zoo.BaseModelLoader.loadModel(BaseModelLoader.java:156)
        at ai.djl.repository.zoo.Criteria.loadModel(Criteria.java:166)

How to Reproduce?

Steps to reproduce

DownloadUtils.download("https://djl-ai.s3.amazonaws.com/mlrepo/model/cv/image_classification/ai/djl/pytorch/resnet/0.0.1/traced_resnet18.pt.gz", "build/pytorch_models/resnet18/resnet18.pt", new ProgressBar());

Translator<Image, Classifications> translator = ImageClassificationTranslator.builder()
      .addTransform(new Resize(256))
      .addTransform(new CenterCrop(224, 224))
      .addTransform(new ToTensor())
      .addTransform(new Normalize(
          new float[] {0.485f, 0.456f, 0.406f},
          new float[] {0.229f, 0.224f, 0.225f}))
      .optApplySoftmax(true)
      .build();

Criteria<Image, Classifications> criteria = Criteria.builder()
      .setTypes(Image.class, Classifications.class)
      .optModelPath(Paths.get("build/pytorch_models/resnet18"))
      .optOption("mapLocation", "true")
      .optTranslator(translator)
      .optDevice(Device.gpu())
      .optProgress(new ProgressBar()).build();

ZooModel model = criteria.loadModel();

What have you tried to solve it?

It seems that DJL doesn't detect my GPU:

System.out.println(CudaUtils.getGpuCount()); // 0
System.out.println(CudaUtils.hasCuda()); // false

zachgk commented 2 years ago

Have you installed NVIDIA CUDA and CUDNN? If so, what version of cuda? Can you also share what DJL gradle/maven dependencies you are using?

et-blanc commented 2 years ago

Yes, I have installed NVIDIA CUDA and CUDNN. My version of cuda is 11.4. You can see below my dependencies:

<dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>ai.djl</groupId>
            <artifactId>bom</artifactId>
            <version>0.15.0</version>
            <type>pom</type>
            <scope>import</scope>
        </dependency>
    </dependencies>
</dependencyManagement>

<dependencies>

    <!-- https://mvnrepository.com/artifact/net.java.dev.jna/jna -->
    <dependency>
        <groupId>net.java.dev.jna</groupId>
        <artifactId>jna</artifactId>
        <version>5.9.0</version>
    </dependency>

    <dependency>
        <groupId>ai.djl</groupId>
        <artifactId>api</artifactId>
    </dependency>

    <dependency>
        <groupId>ai.djl</groupId>
        <artifactId>basicdataset</artifactId>
    </dependency>

    <dependency>
        <groupId>ai.djl</groupId>
        <artifactId>model-zoo</artifactId>
    </dependency>

    <dependency>
        <groupId>ai.djl.sentencepiece</groupId>
        <artifactId>sentencepiece</artifactId>
    </dependency>

    <dependency>
        <groupId>ai.djl.pytorch</groupId>
        <artifactId>pytorch-engine</artifactId>
    </dependency>

    <dependency>
        <groupId>ai.djl.pytorch</groupId>
        <artifactId>pytorch-model-zoo</artifactId>
    </dependency>

    <!-- https://mvnrepository.com/artifact/ai.djl.pytorch/pytorch-jni -->
    <dependency>
        <groupId>ai.djl.pytorch</groupId>
        <artifactId>pytorch-jni</artifactId>
        <version>1.10.0-0.15.0</version>
    </dependency>

    <dependency>
        <groupId>ai.djl.pytorch</groupId>
        <artifactId>pytorch-native-cu113</artifactId>
        <classifier>linux-x86_64</classifier>
        <version>1.10.0</version>
        <scope>runtime</scope>
    </dependency>

    <!-- https://mvnrepository.com/artifact/ai.djl.huggingface/tokenizers -->
    <dependency>
        <groupId>ai.djl.huggingface</groupId>
        <artifactId>tokenizers</artifactId>
        <version>0.15.0</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/args4j/args4j -->
    <dependency>
        <groupId>args4j</groupId>
        <artifactId>args4j</artifactId>
        <version>2.33</version>
    </dependency>

</dependencies>

frankfliu commented 2 years ago

@et-blanc Can you try cuda 11.3?

CudaUtils trying to load libcudart.so file, it seems it's not found in LD_LIBRARY_PATH, or your cuda driver and cuda runtime point to different folder.

Can you check which version the follow command return:

nvcc --version

One more thing you can try, install python version of pytorch and see if it can pick up GPU.

et-blanc commented 2 years ago

The python version of pytorch can pick up GPU and the command nvcc --version returns :

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

It seems that the error comes from my version of CUDA too recent. Is it possible to use DJL with CUDA 11.5 or 11.6? If not, will it be soon?

frankfliu commented 2 years ago

@et-blanc DJL PyTorch 1.10.0 should work with CUDA 11.*

Can you check a few things:

run nvidia-smi -l
can you make sure libcudart.so exists and set properly in LD_LIBRARY_PATH
can you run the following command in DJL repository, it will print debug information about your system environment:
```
cd djl
./gradlew debugEngine -Dai.djl.default_engine=PyTorch
```

et-blanc commented 2 years ago

The command nvidia-smi -l returns:

The file libcudart.so exists and is located at ./usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudart.so. The command echo $LD_LIBRARY_PATH returns:

/usr/local/cuda/lib64:
/usr/local/cuda-11.0/lib64:

Finally, the command ./gradlew debugEngine -Dai.djl.default_engine=PyTorch returns :

frankfliu commented 2 years ago

your libcudart.so comes from cuda-11.0, but your nvidia-smi show your cuda is 11.5, something is wrong in your system.

The command should be ./gradlew debugEnv, sorry for providing a wrong command to you.

And you djl seems in an old version, we have upgraded gradle to 7.2 already. Please get latest code and try the command.

frankfliu commented 2 years ago

Feel free to reopen this issue if you still facing the problem

deepjavalibrary / djl