DJL bench on GPU fails using PyTorch Engine

Jimmy-Newtron commented 5 months ago

Description

I want to run a benchmark of a model on GPU and it fails due an error in the PyTorch Engine

Expected Behavior

Successful benchmark

Error Message

Caused by: ai.djl.engine.EngineException: default_program(22): error: extra text after expected end of number
      aten_mul[(long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)] = v * -3.402823466385289e+38.f;
                                                                                                       ^

default_program(25): error: extra text after expected end of number
    aten_add[(long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)] = v_1 / 5.656854152679443f + v_2 * -3.402823466385289e+38.f;
                                                                                                                                  ^

2 errors detected in the compilation of "default_program".

nvrtc compilation failed: 

#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)

template<typename T>
__device__ T maximum(T a, T b) {
  return isnan(a) ? a : (a > b ? a : b);
}

template<typename T>
__device__ T minimum(T a, T b) {
  return isnan(a) ? a : (a < b ? a : b);
}

extern "C" __global__
void fused_mul_div_add(float* tattention_scores_2, float* tv_, float* aten_add, float* aten_mul) {
{
if (blockIdx.x<2ll ? 1 : 0) {
    float v = __ldg(tv_ + (long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x));
    aten_mul[(long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)] = v * -3.402823466385289e+38.f;
  }  float v_1 = __ldg(tattention_scores_2 + (long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x));
  float v_2 = __ldg(tv_ + ((long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)) % 32ll + 32ll * (((long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)) / 12288ll));
  aten_add[(long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)] = v_1 / 5.656854152679443f + v_2 * -3.402823466385289e+38.f;
}
}

        at ai.djl.pytorch.jni.PyTorchLibrary.moduleRunMethod(Native Method) ~[pytorch-engine-0.26.0.jar:?]
        at ai.djl.pytorch.jni.IValueUtils.forward(IValueUtils.java:57) ~[pytorch-engine-0.26.0.jar:?]
        at ai.djl.pytorch.engine.PtSymbolBlock.forwardInternal(PtSymbolBlock.java:155) ~[pytorch-engine-0.26.0.jar:?]
        at ai.djl.nn.AbstractBaseBlock.forward(AbstractBaseBlock.java:79) ~[api-0.26.0.jar:?]
        at ai.djl.nn.Block.forward(Block.java:127) ~[api-0.26.0.jar:?]
        at ai.djl.inference.Predictor.predictInternal(Predictor.java:143) ~[api-0.26.0.jar:?]
        at ai.djl.inference.Predictor.batchPredict(Predictor.java:170) ~[api-0.26.0.jar:?]
        ... 5 more

How to Reproduce?

djl-bench -e PyTorch -w 10 -c 1000 -s "(32,32)l,(32,32)l" -g 1 -p ./models/model/nlp/text_embedding/ai/djl/huggingface/pytorch/elastic/multilingual-e5-small-optimized/0.0.1/multilingual-e5-small-optimized

Execution logs

[INFO ] - Downloading https://publish.djl.ai/pytorch/2.1.1/cu121/linux-x86_64/native/lib/libc10_cuda.so.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.1.1/cu121/linux-x86_64/native/lib/libcudnn.so.8.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.1.1/cu121/linux-x86_64/native/lib/libnvfuser_codegen.so.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.1.1/cu121/linux-x86_64/native/lib/libc10.so.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.1.1/cu121/linux-x86_64/native/lib/libtorch_cpu.so.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.1.1/cu121/linux-x86_64/native/lib/libcaffe2_nvrtc.so.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.1.1/cu121/linux-x86_64/native/lib/libcudnn_adv_infer.so.8.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.1.1/cu121/linux-x86_64/native/lib/libcudnn_cnn_train.so.8.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.1.1/cu121/linux-x86_64/native/lib/libcudnn_ops_infer.so.8.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.1.1/cu121/linux-x86_64/native/lib/libnvrtc-builtins-6c5639ce.so.12.1.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.1.1/cu121/linux-x86_64/native/lib/libnvrtc-b51b459d.so.12.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.1.1/cu121/linux-x86_64/native/lib/libtorch.so.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.1.1/cu121/linux-x86_64/native/lib/libtorch_cuda_linalg.so.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.1.1/cu121/linux-x86_64/native/lib/libcublas-37d11411.so.12.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.1.1/cu121/linux-x86_64/native/lib/libtorch_cuda.so.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.1.1/cu121/linux-x86_64/native/lib/libcudnn_adv_train.so.8.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.1.1/cu121/linux-x86_64/native/lib/libcublasLt-f97bfc2c.so.12.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.1.1/cu121/linux-x86_64/native/lib/libnvToolsExt-847d78f2.so.1.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.1.1/cu121/linux-x86_64/native/lib/libcudnn_ops_train.so.8.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.1.1/cu121/linux-x86_64/native/lib/libcudnn_cnn_infer.so.8.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.1.1/cu121/linux-x86_64/native/lib/libgomp-52f2fd74.so.1.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.1.1/cu121/linux-x86_64/native/lib/libcudart-9335f6a2.so.12.gz ...
[INFO ] - Downloading jni https://publish.djl.ai/pytorch/2.1.1/jnilib/0.26.0/linux-x86_64/cu121/libdjl_torch.so to cache ...
[INFO ] - PyTorch graph executor optimizer is enabled, this may impact your inference latency and throughput. See: https://docs.djl.ai/docs/development/inference_performance_optimization.html#graph-executor-optimization
[INFO ] - Number of inter-op threads is 8
[INFO ] - Number of intra-op threads is 8
[INFO ] - Load PyTorch (2.1.1) in 0.014 ms.
[INFO ] - Running Benchmark on: gpu(0).
Loading:     100% |████████████████████████████████████████|
[INFO ] - Model sentence-camembert-base loaded in: 773.458 ms.
[INFO ] - Warmup with 10 iteration ...
[ERROR] - Unexpected error
ai.djl.translate.TranslateException: ai.djl.engine.EngineException: default_program(22): error: extra text after expected end of number
      aten_mul[(long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)] = v * -3.402823466385289e+38.f;

Jimmy-Newtron commented 5 months ago

$ sudo dpkg -i djl-bench_0.26.0-1_all.deb
....

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

$ nvidia-smi 
Thu Jan 25 15:01:50 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2070 ...    On  | 00000000:01:00.0  On |                  N/A |
| N/A   58C    P8               8W /  90W |     46MiB /  8192MiB |      3%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      3732      G   /usr/lib/xorg/Xorg                           45MiB |
+---------------------------------------------------------------------------------------+

$ uname -m && cat /etc/*release
x86_64
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

siddvenk commented 5 months ago

This seems like it might be an issue with PyTorch tracing itself (it seems similar to this issue https://github.com/pytorch/pytorch/issues/114035).

Also, can you confirm the model you are attempting to use?

In your report, you mention elastic/multilingual-e5-small-optimized
In the execution logs, I see sentence-camembert-base

Both of those models, as well as the one mentioned in the PyTorch issue, are variants of BERT. There might be an issue with tracing those models.

I will try to reproduce the issue once you confirm which model you are facing issues with.

Jimmy-Newtron commented 5 months ago

@siddvenk you have spotted it right, I have been testing multiple models variants of BERT to compare them.

Here the list of models failing:

elastic/e5 (priority high)
Lajavaness/base
Lajavaness/large
sbert/all-MiniLM-L6-v2 (important)
sbert/paraphrase-multilingual-mpnet-base-v2
infgrad/stella-base-en-v2
BAAI/bge-large-en-v1.5

All of them are failing

siddvenk commented 5 months ago

Solution: Use PyTorch 2.0.1 like this

PYTORCH_VERSION=2.0.1 djl-bench -e PyTorch -w 10 -c 1000 -s "(32,32)l,(32,32)l" -g 1 -p /home/ubuntu/models/model/nlp/text_embedding/ai/djl/huggingface/pytorch/elastic/multilingual-e5-small-optimized/0.0.1/multilingual-e5-small-optimized.zip

Unfortunately, this seems like an issue with PyTorch 2.1.x. That's the default version of PyTorch we use for DJL 0.26.0. See this related PyTorch issue https://github.com/pytorch/pytorch/issues/107503. torchscript is in maintenence mode, so this issue will likely never be fixed moving forward. Until there is support for serializing compiled models so that we can load torch.compiled models, you might have to stick with PyTorch 2.0.1.

I can reproduce your issue:

(.hfdjlvenv) ubuntu@xxxxxxxx:~$ djl-bench -e PyTorch -w 10 -c 1000 -s "(32,32)l,(32,32)l" -g 1 -p /home/ubuntu/models/model/nlp/text_embedding/ai/djl/huggingface/pytorch/elastic/multilingual-e5-small-optimized/0.0.1/multilingual-e5-small-optimized.zip
[INFO ] - DJL will collect telemetry to help us better understand our users’ needs, diagnose issues, and deliver additional features. If you would like to learn more or opt-out please go to: https://docs.djl.ai/docs/telemetry.html for more information.
[INFO ] - PyTorch graph executor optimizer is enabled, this may impact your inference latency and throughput. See: https://docs.djl.ai/docs/development/inference_performance_optimization.html#graph-executor-optimization
[INFO ] - Number of inter-op threads is 24
[INFO ] - Number of intra-op threads is 24
[INFO ] - Load PyTorch (2.1.1) in 0.033 ms.
[INFO ] - Running Benchmark on: gpu(0).
Downloading: 100% |████████████████████████████████████████|
Loading:     100% |████████████████████████████████████████|
[INFO ] - Model multilingual-e5-small-optimized loaded in: 5199.294 ms.
[INFO ] - Warmup with 10 iteration ...
[ERROR] - Unexpected error
ai.djl.translate.TranslateException: ai.djl.engine.EngineException: default_program(22): error: extra text after expected end of number
      aten_mul[(long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)] = v * -3.402823466385289e+38.f;
                                                                                                       ^

default_program(25): error: extra text after expected end of number
    aten_add[(long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)] = v_1 / 5.656854152679443f + v_2 * -3.402823466385289e+38.f;
                                                                                                                                  ^

2 errors detected in the compilation of "default_program".

nvrtc compilation failed:

#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)

The good news is that PyTorch 2.0.1 seems to work just fine.

(.hfdjlvenv) ubuntu@xxxxx:~$ PYTORCH_VERSION=2.0.1 djl-bench -e PyTorch -w 10 -c 1000 -s "(32,32)l,(32,32)l" -g 1 -p /home/ubuntu/models/model/nlp/text_embedding/ai/djl/huggingface/pytorch/elastic/multilingual-e5-small-optimized/0.0.1/multilingual-e5-small-optimized.zip
[INFO ] - DJL will collect telemetry to help us better understand our users’ needs, diagnose issues, and deliver additional features. If you would like to learn more or opt-out please go to: https://docs.djl.ai/docs/telemetry.html for more information.
[WARN ] - Override PyTorch version: 2.0.1.
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.0.1/cu118/linux-x86_64/native/lib/libc10_cuda.so.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.0.1/cu118/linux-x86_64/native/lib/libcublas-3b81d170.so.11.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.0.1/cu118/linux-x86_64/native/lib/libnvfuser_codegen.so.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.0.1/cu118/linux-x86_64/native/lib/libc10.so.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.0.1/cu118/linux-x86_64/native/lib/libnvrtc-builtins-2dc4bf68.so.11.8.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.0.1/cu118/linux-x86_64/native/lib/libtorch_cpu.so.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.0.1/cu118/linux-x86_64/native/lib/libcaffe2_nvrtc.so.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.0.1/cu118/linux-x86_64/native/lib/libtorch.so.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.0.1/cu118/linux-x86_64/native/lib/libtorch_cuda_linalg.so.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.0.1/cu118/linux-x86_64/native/lib/libnvrtc-672ee683.so.11.2.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.0.1/cu118/linux-x86_64/native/lib/libtorch_cuda.so.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.0.1/cu118/linux-x86_64/native/lib/libnvToolsExt-847d78f2.so.1.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.0.1/cu118/linux-x86_64/native/lib/libgomp-52f2fd74.so.1.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.0.1/cu118/linux-x86_64/native/lib/libcublasLt-b6d14a74.so.11.gz ...
[INFO ] - Downloading https://publish.djl.ai/pytorch/2.0.1/cu118/linux-x86_64/native/lib/libcudart-d0da41ae.so.11.0.gz ...
[INFO ] - Downloading jni https://publish.djl.ai/pytorch/2.0.1/jnilib/0.26.0/linux-x86_64/cu118/libdjl_torch.so to cache ...
[INFO ] - PyTorch graph executor optimizer is enabled, this may impact your inference latency and throughput. See: https://docs.djl.ai/docs/development/inference_performance_optimization.html#graph-executor-optimization
[INFO ] - Number of inter-op threads is 24
[INFO ] - Number of intra-op threads is 24
[INFO ] - Load PyTorch (2.0.1) in 0.031 ms.
[INFO ] - Running Benchmark on: gpu(0).
Loading:     100% |████████████████████████████████████████|
[INFO ] - Model multilingual-e5-small-optimized loaded in: 531.493 ms.
[INFO ] - Warmup with 10 iteration ...
[INFO ] - Warmup latency, min: 6.199 ms, max: 2030.892 ms
Iteration:   100% |████████████████████████████████████████|
[INFO ] - Inference result: [0.012903332, 0.637843, 0.35279134 ...]
[INFO ] - Throughput: 164.85, completed 1000 iteration in 6066 ms.
[INFO ] - Model loading time: 531.493 ms.
[INFO ] - total P50: 6.031 ms, P90: 6.062 ms, P99: 6.121 ms
[INFO ] - inference P50: 3.668 ms, P90: 3.710 ms, P99: 3.788 ms
[INFO ] - preprocess P50: 0.040 ms, P90: 0.049 ms, P99: 0.069 ms
[INFO ] - postprocess P50: 2.316 ms, P90: 2.340 ms, P99: 2.370 ms

Jimmy-Newtron commented 5 months ago

Thanks for the investigation. I see that Torch is working on a 2.2 release and I wonder if they will fix the issue as part of the new release. I hope in the few months to come to see a working DJL version that supports PyTorch engine with Cuda 12.1+

david-sitsky commented 2 months ago

@siddvenk - note this workaround no longer works since DJL as of 0.27.0 no longer support PyTorch 2.0.1: https://github.com/deepjavalibrary/djl/blob/master/engines/pytorch/pytorch-engine/README.md.

deepjavalibrary / djl