"Import Error: /libs/libth_common.so: Undefined Symbol" While Building

eurus-ch commented 8 months ago

Hi,

while trying to run this

python build.py --model_dir $model_dir$ \
                --dtype float16 \
                --use_gpt_attentionZ_plugin float16 \
                --use_gemm_plugin float16 \
                --max_batch_size 4 \
                --max_input_len 128 \
                --max_output_len 128

we run into this FATAL ERROR, a strange undefined symbol

Traceback (most recent call last):  
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 56, in _init   torch.classes.load_library(ft_decoder_lib)  
File "/usr/local/lib/python3.10/dist-packages/torch/_classes.py", line 51, in load_library      torch.ops.load_library(path)  
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 841, in load_library
    ctypes.CDLL(path)  
File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)OSError: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libth_common.so: undefined symbol: _ZN5torch6detail10class_baseC2ERKSsS3_SsRKSt9type_infoS6_

During handling of the above exception, another exception occurred:Traceback (most recent call last):  

File "TensorRT-LLM/examples/llama/build.py", line 33, in <module>
    from weight import (get_scaling_factors, load_from_awq_llama, load_from_binary,  
File "TensorRT-LLM/examples/llama/weight.py", line 24, in <module>
    import tensorrt_llm  
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/__init__.py", line 61, in <module>    
    _init(log_level="error")  
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 59, in _init   raise ImportError(str(e) + msg)
ImportError: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libth_common.so: undefined symbol: _ZN5torch6detail10class_baseC2ERKSsS3_SsRKSt9type_infoS6_
FATAL: Decoding operators failed to load. This may be caused by the incompatibility between PyTorch and TensorRT-LLM. Please rebuild and install TensorRT-LLM.

Before that, we built wheel through

python3 ./scripts/build_wheel.py --clean  --trt_root /usr/local/tensorrt

And the software versions are

tensorboard               2.9.0
tensorboard-data-server   0.6.1
tensorboard-plugin-wit    1.8.1
tensorrt                  9.2.0.post12.dev5
tensorrt-llm              0.7.1
torch-tensorrt            0.0.0
pytorch-quantization      2.1.2
torch                     2.1.0a0+32f93b1
torch-tensorrt            0.0.0
torchdata                 0.7.0a0
torchtext                 0.16.0a0
torchvision               0.16.0a0

Have you got any clue on solving this? Much thanks!

Shixiaowei02 commented 8 months ago

Please ensure that you build and run TensorRT-LLM in the same environment. Alternatively, you can try building TensorRT-LLM in a Docker container by executing this command:

make -C docker release_build

Thank you!

eurus-ch commented 8 months ago

Using tensorrt-llm 0.6.1, and the error changes into this

Traceback (most recent call last):  
File "/usr/local/lib/python3.10/dist-packages/pynvml/nvml.py", line 850, in _nvmlGetFunctionPointer    
    _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
File "/usr/lib/python3.10/ctypes/__init__.py", line 387, in __getattr__    
    func = self.__getitem__(name)  
File "/usr/lib/python3.10/ctypes/__init__.py", line 392, in __getitem__    
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetMemoryInfo_v2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):  
File "/TensorRT-LLM/./examples/llama/build.py", line 906, in <module>    
    build(0, args)  
File "/TensorRT-LLM/./examples/llama/build.py", line 850, in build    
    engine = build_rank_engine(builder, builder_config, engine_name,  
File "/TensorRT-LLM/./examples/llama/build.py", line 609, in build_rank_engine    
    profiler.print_memory_usage(f'Rank {rank} Engine build starts')  
File "/TensorRT-LLM/tensorrt_llm/profiler.py", line 197, in print_memory_usage    
    alloc_device_mem, _, _ = device_memory_info(device=device)  
File "/TensorRT-LLM/tensorrt_llm/profiler.py", line 148, in device_memory_info    
    mem_info = _device_get_memory_info_fn(handle)  
File "/usr/local/lib/python3.10/dist-packages/pynvml/nvml.py", line 2438, in nvmlDeviceGetMemoryInfo    
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo_v2") 
File "/usr/local/lib/python3.10/dist-packages/pynvml/nvml.py", line 853, in _nvmlGetFunctionPointer    
    raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found

Thank you, but I'm developing in a Docker and building another Docker within seems restrained so...

woskii commented 8 months ago

ImportError: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libth_common.so: undefined symbol: _ZN5torch6detail10class_baseC2ERKSsS3_SsRKSt9typeinfoS6 FATAL: Decoding operators failed to load. This may be caused by the incompatibility between PyTorch and TensorRT-LLM. Please rebuild and install TensorRT-LLM.

=================================================================== I solved this error by manually installing pytorch 2.1.0 Command like this: pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121

ekagra-ranjan commented 7 months ago

raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found

I too faced this issue. This was the fix: https://github.com/NVIDIA/k8s-device-plugin/issues/331#issuecomment-1859143566

dongteng commented 7 months ago

ImportError: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libth_common.so: undefined symbol: _ZN5torch6detail10class_baseC2ERKSsS3_SsRKSt9typeinfoS6 FATAL: Decoding operators failed to load. This may be caused by the incompatibility between PyTorch and TensorRT-LLM. Please rebuild and install TensorRT-LLM.

=================================================================== I solved this error by manually installing pytorch 2.1.0 Command like this: pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121

Thanks

AbhisKmr commented 5 months ago

Im still facing the same issue

ImportError: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libth_common.so: undefined symbol: _ZN5torch6detail10class_baseC2ERKSsS3_SsRKSt9typeinfoS6 FATAL: Decoding operators failed to load. This may be caused by the incompatibility between PyTorch and TensorRT-LLM. Please rebuild and install TensorRT-LLM.

=================================================================== I solved this error by manually installing pytorch 2.1.0 Command like this: pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121

Im still getting the same error my env configs are attrs 23.2.0 av 10.0.0 bcrypt 4.1.2 braceexpand 0.1.7 certifi 2020.6.20 cffi 1.16.0 chardet 4.0.0 charset-normalizer 3.3.2 coloredlogs 15.0.1 cryptography 42.0.5 ctranslate2 3.24.0 dbus-python 1.2.16 distro 1.9.0 distro-info 1.0+deb11u1 docker 7.0.0 docker-compose 1.29.2 dockerpty 0.4.1 docopt 0.6.2 einops 0.7.0 encodec 0.1.1 fastcore 1.5.29 faster-whisper 0.9.0 fastprogress 1.0.3 ffmpeg-python 0.2.0 filelock 3.13.3 flatbuffers 24.3.25 fsspec 2024.3.1 future 1.0.0 httplib2 0.18.1 huggingface-hub 0.17.3 humanfriendly 10.0 HyperPyYAML 1.2.2 idna 2.10 Jinja2 3.1.3 joblib 1.3.2 jsonschema 3.2.0 kaldialign 0.9.1 llvmlite 0.42.0 MarkupSafe 2.1.5 more-itertools 10.2.0 mpmath 1.3.0 networkx 3.2.1 numba 0.59.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.19.3 nvidia-nvjitlink-cu12 12.4.99 nvidia-nvtx-cu12 12.1.105 nvidia-pyindex 1.0.9 onnxruntime 1.16.0 openai-whisper 20231117 packaging 24.0 paramiko 3.4.0 pillow 10.2.0 pip 20.3.4 protobuf 5.26.1 pycparser 2.22 pycurl 7.43.0.6 PyGObject 3.38.0 PyNaCl 1.5.0 pyrsistent 0.20.0 PySimpleSOAP 1.16.2 python-apt 2.2.1 python-debian 0.1.39 python-debianbts 3.1.0 python-dotenv 0.21.1 python-snappy 0.5.3 PyYAML 5.4.1 regex 2023.12.25 reportbug 7.10.3+deb11u1 requests 2.31.0 ruamel.yaml 0.18.6 ruamel.yaml.clib 0.2.8 scipy 1.12.0 sentencepiece 0.2.0 setuptools 52.0.0 six 1.16.0 soundfile 0.12.1 speechbrain 0.5.16 sympy 1.12 tensorrt 8.6.1.post1 tensorrt-bindings 8.6.1 tensorrt-libs 8.6.1 texttable 1.7.0 tiktoken 0.3.3 tokenizers 0.14.1 torch 2.1.0+cu121 torchaudio 2.1.0+cu121 torchvision 0.16.0+cu121 tqdm 4.66.2 triton 2.1.0 typing-extensions 4.10.0 unattended-upgrades 0.1 urllib3 1.26.5 vocos 0.1.0 websocket-client 0.59.0 websockets 12.0 wheel 0.34.2 WhisperSpeech 0.8

development hardware: google cloud

Error message::

FATAL: Decoding operators failed to load. This may be caused by the incompatibility between PyTorch and TensorRT-LLM. Please rebuild and install TensorRT-LLM.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 58, in _init
    torch.classes.load_library(ft_decoder_lib)
  File "/usr/local/lib/python3.10/dist-packages/torch/_classes.py", line 51, in load_library
    torch.ops.load_library(path)
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 933, in load_library
    ctypes.CDLL(path)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libth_common.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/WhisperFusion/main.py", line 11, in <module>
    from whisper_live.trt_server import TranscriptionServer
  File "/root/WhisperFusion/whisper_live/trt_server.py", line 17, in <module>
    from whisper_live.trt_transcriber import WhisperTRTLLM
  File "/root/WhisperFusion/whisper_live/trt_transcriber.py", line 16, in <module>
    import tensorrt_llm
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/__init__.py", line 64, in <module>
    _init(log_level="error")
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 61, in _init
    raise ImportError(str(e) + msg)
ImportError: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libth_common.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev
FATAL: Decoding operators failed to load. This may be caused by the incompatibility between PyTorch and TensorRT-LLM. Please rebuild and install TensorRT-LLM.

NVIDIA / TensorRT-LLM

"Import Error: /libs/libth_common.so: Undefined Symbol" While Building #808