I followed the instructions from the README and the docker image built fine.
However, when I ran it, the WhisperFusion process fails (which makes the webapp not work).
The problem is unfortunately hidden because by default all of the logs in build-models.sh are sent to /dev/null. Removing that I get this
whisperfusion-1 | [06/06/2024-00:16:18] [TRT-LLM] [I] plugin_arg is None, setting it as float16 automatically.
whisperfusion-1 | [06/06/2024-00:16:18] [TRT-LLM] [I] plugin_arg is None, setting it as float16 automatically.
whisperfusion-1 | [06/06/2024-00:16:18] [TRT-LLM] [I] plugin_arg is None, setting it as float16 automatically.
whisperfusion-1 | [06/06/2024-00:16:18] [TRT] [W] Unable to determine GPU memory usage: named symbol not found
whisperfusion-1 | [06/06/2024-00:16:18] [TRT] [W] Unable to determine GPU memory usage: named symbol not found
whisperfusion-1 | [06/06/2024-00:16:18] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 558, GPU 0 (MiB)
whisperfusion-1 | [06/06/2024-00:16:18] [TRT] [E] 6: CUDA initialization failure with error: 500. Please check your CUDA installation: http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
whisperfusion-1 | Traceback (most recent call last):
whisperfusion-1 | File "/root/TensorRT-LLM-examples/whisper/build.py", line 384, in <module>
whisperfusion-1 | run_build(args)
whisperfusion-1 | File "/root/TensorRT-LLM-examples/whisper/build.py", line 378, in run_build
whisperfusion-1 | build_encoder(model, args)
whisperfusion-1 | File "/root/TensorRT-LLM-examples/whisper/build.py", line 188, in build_encoder
whisperfusion-1 | builder = Builder()
whisperfusion-1 | File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 82, in __init__
whisperfusion-1 | self._trt_builder = trt.Builder(logger.trt_logger)
whisperfusion-1 | TypeError: pybind11::init(): factory function returned nullptr
which seems to be a problem of not finding my GPU, however if I run
docker run -it --gpus=all --rm whisperfusion:latest nvidia-smi
I get the expected result
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.03 Driver Version: 555.85 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 On | Off |
| 30% 42C P2 108W / 450W | 5983MiB / 24564MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
Next, I tried to see if PyTorch can see the GPU with this command
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 500: named symbol not found (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
I googled around for that error and a found this issue
which is recent (last week) problem in the nvidia-container-toolkit about a missing dll that it's needed for CUDA to work properly when operating in WSL
The issue suggests to upgrade the nvidia-container-toolkit but it also says that if we're using Docker Desktop, as I am, a solution is not yet available and the only solution is to downgrade the NVIDIA Driver to version 552.xx or earlier. I'm probably just going to wait until a fix is available, but I thought I'd pass this along because others might be running into the same problem.
@stefanom We are working on a fix and upgrading the TensorRT-LLM version as well. That said, we have it working on WSL2, expect a fix sometime this week or early next week.
I followed the instructions from the README and the docker image built fine.
However, when I ran it, the WhisperFusion process fails (which makes the webapp not work).
The problem is unfortunately hidden because by default all of the logs in build-models.sh are sent to /dev/null. Removing that I get this
which seems to be a problem of not finding my GPU, however if I run
I get the expected result
Next, I tried to see if PyTorch can see the GPU with this command
I get the expected
1
but when I run this other commandI get the error
I googled around for that error and a found this issue
https://github.com/NVIDIA/nvidia-container-toolkit/issues/520
which is recent (last week) problem in the
nvidia-container-toolkit
about a missing dll that it's needed for CUDA to work properly when operating in WSLThe issue suggests to upgrade the
nvidia-container-toolkit
but it also says that if we're using Docker Desktop, as I am, a solution is not yet available and the only solution is to downgrade the NVIDIA Driver to version 552.xx or earlier. I'm probably just going to wait until a fix is available, but I thought I'd pass this along because others might be running into the same problem.