collabora / WhisperLive

A nearly-live implementation of OpenAI's Whisper.
MIT License
2.1k stars 286 forks source link

Tensor backend core dumped #208

Closed muaydin closed 5 months ago

muaydin commented 6 months ago

here is my nvidia-smi result

image

python -c "import torch; import tensorrt; import tensorrt_llm" working well

When a client is connected server is getting core dumped related to libcudnn_cnn_infer library. here is the related part of the log

60      0x7ff18fd2dac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7ff18fd2dac3]
61      0x7ff18fdbebf4 clone + 68
Could not load library libcudnn_cnn_infer.so.8. Error: /lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8: undefined symbol: _ZN5cudnn14cublasSaxpy_v2EP13cublasContextiPKfS3_iPfi, version libcudnn_ops_infer.so.8
[e41f6f59a514:02294] *** Process received signal ***
[e41f6f59a514:02294] Signal: Aborted (6)
[e41f6f59a514:02294] Signal code:  (-6)
[e41f6f59a514:02294] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7ff18fcdb520]
[e41f6f59a514:02294] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7ff18fd2f9fc]
[e41f6f59a514:02294] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7ff18fcdb476]
[e41f6f59a514:02294] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7ff18fcc17f3]

what could be the reason ?

my ubutu version is

cat /etc/os-release on azure vm and getting following
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"

your docker image ubuntu version is (running on 20.04)

root@e41f6f59a514:/home/WhisperLive# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"

can it be related to Ubuntu 22.04?

muaydin commented 6 months ago

fixed with

export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

but now I am getting TensorRT-LLM not supported:

python3 run_server.py --port 9090                       --backend tensorrt                       --trt_model_path "/root/TensorRT-LLM-examples/whisper/whisper_small"
[05/06/2024-09:07:24] TensorRT-LLM not supported: [TensorRT-LLM][ERROR] CUDA runtime error in cub::DeviceSegmentedRadixSort::SortPairsDescending(nullptr, cubTempStorageSize, logProbs, (T*) nullptr, idVals, (int*) nullptr, vocabSize * batchSize, batchSize, beginOffsetBuf, offsetBuf + 1, 0, sizeof(T) * 8, stream): no kernel image is available for execution on the device (/root/TensorRT-LLM/cpp/tensorrt_llm/kernels/samplingTopPKernels.cu:322)
1       0x7f4b9c74b825 void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 149
2       0x7f4b9c837858 void tensorrt_llm::kernels::invokeBatchTopPSampling<__half>(void*, unsigned long&, unsigned long&, int**, int*, tensorrt_llm::kernels::FinishedState const*, tensorrt_llm::kernels::FinishedState*, float*, float*, __half const*, int const*, int*, int*, curandStateXORWOW*, int, unsigned long, int const*, float, float const*, CUstream_st*, bool const*) + 2200

_no kernel image is available for execution on the device (/root/TensorRT-LLM/cpp/tensorrtllm/kernels/samplingTopPKernels.cu:322)

muaydin commented 6 months ago

if I try to build TensorRT-LLM container manually eventually I got

python3 run_server.py --port 9090 --backend tensorrt --trt_model_path "/app/tensorrt_llm/examples/whisper/whisper_small"
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024050700
[05/08/2024-09:19:41] TensorRT-LLM not supported: Trying to create tensor with negative dimension -1: [-1, 1500, 768]

GPU: Tesla T4, I built TensorRT-LLM with make -C docker release_build CUDA_ARCHS="75"

Note: and it throws exception [05/08/2024-09:11:46] TensorRT-LLM not supported: ModelConfig.__init__() missing 2 required positional arguments: 'max_batch_size' and 'max_beam_width' If fixed it by adding

decoder_model_config = ModelConfig(
            max_batch_size=self.decoder_config['max_batch_size'],
            max_beam_width=self.decoder_config['max_beam_width'],
...
makaveli10 commented 6 months ago

Thanks for reporting and tracking the issue, we are looking into this at our end as well.

peldszus commented 6 months ago

I also ran into those issues.

When you stick to TensorRT LLM 0.7.1, you neither get model config error (I applied the same fix as you), nor the negative dimension error (I didn't have the time to look deeper into that).

I have a working build in #221, feel free to give it a try.

makaveli10 commented 5 months ago

Closed by #227