WhisperFusion currently doesn't work in WSL2 with Docker Desktop (CUDA init issue in PyTorch)

stefanom commented 1 month ago

I followed the instructions from the README and the docker image built fine.

However, when I ran it, the WhisperFusion process fails (which makes the webapp not work).

The problem is unfortunately hidden because by default all of the logs in build-models.sh are sent to /dev/null. Removing that I get this

whisperfusion-1  | [06/06/2024-00:16:18] [TRT-LLM] [I] plugin_arg is None, setting it as float16 automatically.
whisperfusion-1  | [06/06/2024-00:16:18] [TRT-LLM] [I] plugin_arg is None, setting it as float16 automatically.
whisperfusion-1  | [06/06/2024-00:16:18] [TRT-LLM] [I] plugin_arg is None, setting it as float16 automatically.
whisperfusion-1  | [06/06/2024-00:16:18] [TRT] [W] Unable to determine GPU memory usage: named symbol not found
whisperfusion-1  | [06/06/2024-00:16:18] [TRT] [W] Unable to determine GPU memory usage: named symbol not found
whisperfusion-1  | [06/06/2024-00:16:18] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 558, GPU 0 (MiB)
whisperfusion-1  | [06/06/2024-00:16:18] [TRT] [E] 6: CUDA initialization failure with error: 500. Please check your CUDA installation: http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
whisperfusion-1  | Traceback (most recent call last):
whisperfusion-1  |   File "/root/TensorRT-LLM-examples/whisper/build.py", line 384, in <module>
whisperfusion-1  |     run_build(args)
whisperfusion-1  |   File "/root/TensorRT-LLM-examples/whisper/build.py", line 378, in run_build
whisperfusion-1  |     build_encoder(model, args)
whisperfusion-1  |   File "/root/TensorRT-LLM-examples/whisper/build.py", line 188, in build_encoder
whisperfusion-1  |     builder = Builder()
whisperfusion-1  |   File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 82, in __init__
whisperfusion-1  |     self._trt_builder = trt.Builder(logger.trt_logger)
whisperfusion-1  | TypeError: pybind11::init(): factory function returned nullptr

which seems to be a problem of not finding my GPU, however if I run

docker run -it --gpus=all --rm whisperfusion:latest nvidia-smi

I get the expected result

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.03              Driver Version: 555.85         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:01:00.0  On |                  Off |
| 30%   42C    P2            108W /  450W |    5983MiB /  24564MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Next, I tried to see if PyTorch can see the GPU with this command

docker run -it --gpus=all --rm whisperfusion:latest python -c 'import torch; print(torch.cuda.device_count())'

I get the expected 1 but when I run this other command

docker run -it --gpus=all --rm whisperfusion:latest python -c 'import torch; print(torch.cuda.is_available())'

I get the error

/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 500: named symbol not found (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False

I googled around for that error and a found this issue

https://github.com/NVIDIA/nvidia-container-toolkit/issues/520

which is recent (last week) problem in the nvidia-container-toolkit about a missing dll that it's needed for CUDA to work properly when operating in WSL

The issue suggests to upgrade the nvidia-container-toolkit but it also says that if we're using Docker Desktop, as I am, a solution is not yet available and the only solution is to downgrade the NVIDIA Driver to version 552.xx or earlier. I'm probably just going to wait until a fix is available, but I thought I'd pass this along because others might be running into the same problem.

makaveli10 commented 3 weeks ago

@stefanom We are working on a fix and upgrading the TensorRT-LLM version as well. That said, we have it working on WSL2, expect a fix sometime this week or early next week.

makaveli10 commented 3 weeks ago

https://github.com/collabora/WhisperFusion/issues/51

makaveli10 commented 2 weeks ago

Closed by https://github.com/collabora/WhisperFusion/pull/53

collabora / WhisperFusion

WhisperFusion currently doesn't work in WSL2 with Docker Desktop (CUDA init issue in PyTorch) #52