libth_common.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev with Latest docker image

OliverWalter commented 4 months ago

I tried to run the latest (as of today) docker image:

docker run --gpus all --shm-size 64G -p 8001:80 ghcr.io/collabora/whisperfusion:latest

Im getting the error OSError: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libth_common.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev. See below for details.

I'm using the following Version:

docker pull ghcr.io/collabora/whisperfusion:latest
latest: Pulling from collabora/whisperfusion
Digest: sha256:dc6029a768c15a7588008f415840eea5939fae7b7d079496b5f96242ae83ea48
Status: Image is up to date for ghcr.io/collabora/whisperfusion:latest
ghcr.io/collabora/whisperfusion:latest

s6-rc: info: service legacy-services successfully started
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 58, in _init
    torch.classes.load_library(ft_decoder_lib)
  File "/usr/local/lib/python3.10/dist-packages/torch/_classes.py", line 51, in load_library
    torch.ops.load_library(path)
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 933, in load_library
    ctypes.CDLL(path)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libth_common.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/WhisperFusion/main.py", line 11, in <module>
    from whisper_live.trt_server import TranscriptionServer
  File "/root/WhisperFusion/whisper_live/trt_server.py", line 17, in <module>
    from whisper_live.trt_transcriber import WhisperTRTLLM
  File "/root/WhisperFusion/whisper_live/trt_transcriber.py", line 16, in <module>
    import tensorrt_llm
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/__init__.py", line 64, in <module>
    _init(log_level="error")
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 61, in _init
    raise ImportError(str(e) + msg)
ImportError: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libth_common.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev
FATAL: Decoding operators failed to load. This may be caused by the incompatibility between PyTorch and TensorRT-LLM. Please rebuild and install TensorRT-LLM.

kalradivyanshu commented 4 months ago

I am facing the same issues, any solutions for this?

zoq commented 4 months ago

What GPU do you have?

OliverWalter commented 4 months ago

I tested on an Nvidia A100/80GB with compute capability 3.9.

Marcus Edel @.***> schrieb am Mo., 19. Feb. 2024, 13:36:

What GPU do you have?

— Reply to this email directly, view it on GitHub https://github.com/collabora/WhisperFusion/issues/40#issuecomment-1952362300, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEBMNPKZ4Q74GQST7ESDYIDYUNBM7AVCNFSM6AAAAABDL3WK4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJSGM3DEMZQGA . You are receiving this because you authored the thread.Message ID: @.***>

kalradivyanshu commented 4 months ago

Nvidia L4/24GB, on GCP

kalradivyanshu commented 4 months ago

zoq commented 4 months ago

Thanks, we are able to reproduce the issue. We have to update how we build TensorRT, to support not only 3090 and 4090, we are going to push a solution over the next two days.

collabora / WhisperFusion

libth_common.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev with Latest docker image #40