bloom 560M can not build

Lenan22 commented 11 months ago

python build.py --model_dir ./bloom/560M/ --dtype float16 --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --output_dir ./bloom/560M/trt_engines/fp16/1-gpu/

Open MPI's OFI driver detected multiple equidistant NICs from the current process, but had insufficient information to ensure MPI processes fairly pick a NIC for use. This may negatively impact performance. A more modern PMIx server is necessary to resolve this issue.

[12/14/2023-14:28:39] [TRT-LLM] [I] Serially build TensorRT engines. [12/14/2023-14:28:39] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 166, GPU 649 (MiB) [12/14/2023-14:28:41] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +482, GPU +80, now: CPU 784, GPU 729 (MiB) [12/14/2023-14:28:41] [TRT-LLM] [W] Invalid timing cache, using freshly created one Traceback (most recent call last): File "/home/tiger/.local/lib/python3.10/site-packages/pynvml/nvml.py", line 850, in _nvmlGetFunctionPointer _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name) File "/usr/local/lib/python3.10/ctypes/init.py", line 387, in getattr func = self.getitem(name) File "/usr/local/lib/python3.10/ctypes/init.py", line 392, in getitem func = self._FuncPtr((name_or_ordinal, self)) AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetMemoryInfo_v2

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/mnt/bn/xiaxin/wulenan/workspace/TensorRT-LLM/examples/bloom/build.py", line 556, in build(0, args) File "/mnt/bn/xiaxin/wulenan/workspace/TensorRT-LLM/examples/bloom/build.py", line 524, in build engine = build_rank_engine(builder, builder_config, engine_name, File "/mnt/bn/xiaxin/wulenan/workspace/TensorRT-LLM/examples/bloom/build.py", line 337, in build_rank_engine profiler.print_memory_usage(f'Rank {rank} Engine build starts') File "/home/tiger/.local/lib/python3.10/site-packages/tensorrt_llm/profiler.py", line 197, in print_memory_usage alloc_devicemem, , _ = device_memory_info(device=device) File "/home/tiger/.local/lib/python3.10/site-packages/tensorrt_llm/profiler.py", line 148, in device_memory_info mem_info = _device_get_memory_info_fn(handle) File "/home/tiger/.local/lib/python3.10/site-packages/pynvml/nvml.py", line 2438, in nvmlDeviceGetMemoryInfo fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo_v2") File "/home/tiger/.local/lib/python3.10/site-packages/pynvml/nvml.py", line 853, in _nvmlGetFunctionPointer raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND) pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found

jdemouth-nvidia commented 10 months ago

What system are you running on? Which OS?

yz-tang commented 9 months ago

@jdemouth-nvidia I encountered the same problem. I am using the image: nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3。I'am running on x86 ubuntu。

nullxjx commented 9 months ago

@jdemouth-nvidia I encountered the same problem too，I am using the image: nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3。

My system info:

# cat /etc/os-release NAME="CentOS Linux" VERSION="8 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="8" PLATFORM_ID="platform:el8" PRETTY_NAME="CentOS Linux 8 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:8" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-8" CENTOS_MANTISBT_PROJECT_VERSION="8" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="8"

nullxjx commented 9 months ago

nvmlDeviceGetMemoryInfo_v2

@yz-tang I fix this by downgrading pynvml from 11.5.0 to 11.4.0 my tensorrt_llm version is 0.8.0.dev2024011601, using pynvml of version 11.4.0 may occur a warning 'Found pynvml==11.4.0. Please use pynvml>=11.5.0 to get accurate memory usage', just ignore it.

yjjiang11 commented 9 months ago

I got same error while using NV Driver 470.199.02 NV Driver 535.54.03 works.

NVIDIA / TensorRT-LLM

bloom 560M can not build #658

python build.py --model_dir ./bloom/560M/ --dtype float16 --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --output_dir ./bloom/560M/trt_engines/fp16/1-gpu/

Open MPI's OFI driver detected multiple equidistant NICs from the current process, but had insufficient information to ensure MPI processes fairly pick a NIC for use. This may negatively impact performance. A more modern PMIx server is necessary to resolve this issue.