Open Lenan22 opened 11 months ago
What system are you running on? Which OS?
@jdemouth-nvidia I encountered the same problem. I am using the image: nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3。I'am running on x86 ubuntu。
@jdemouth-nvidia I encountered the same problem too,I am using the image: nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3。
My system info:
# cat /etc/os-release NAME="CentOS Linux" VERSION="8 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="8" PLATFORM_ID="platform:el8" PRETTY_NAME="CentOS Linux 8 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:8" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-8" CENTOS_MANTISBT_PROJECT_VERSION="8" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="8"
My GPU info: Wed Jan 24 15:08:06 2024 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 12.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A10 On | 00000000:00:08.0 Off | 0 | | 0% 48C P0 61W / 150W | 3926MiB / 22731MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A10 On | 00000000:00:09.0 Off | 0 | | 0% 47C P0 62W / 150W | 3254MiB / 22731MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
nvmlDeviceGetMemoryInfo_v2
@yz-tang I fix this by downgrading pynvml from 11.5.0 to 11.4.0 my tensorrt_llm version is 0.8.0.dev2024011601, using pynvml of version 11.4.0 may occur a warning 'Found pynvml==11.4.0. Please use pynvml>=11.5.0 to get accurate memory usage', just ignore it.
I got same error while using NV Driver 470.199.02 NV Driver 535.54.03 works.
python build.py --model_dir ./bloom/560M/ --dtype float16 --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --output_dir ./bloom/560M/trt_engines/fp16/1-gpu/
Open MPI's OFI driver detected multiple equidistant NICs from the current process, but had insufficient information to ensure MPI processes fairly pick a NIC for use. This may negatively impact performance. A more modern PMIx server is necessary to resolve this issue.
[12/14/2023-14:28:39] [TRT-LLM] [I] Serially build TensorRT engines. [12/14/2023-14:28:39] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 166, GPU 649 (MiB) [12/14/2023-14:28:41] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +482, GPU +80, now: CPU 784, GPU 729 (MiB) [12/14/2023-14:28:41] [TRT-LLM] [W] Invalid timing cache, using freshly created one Traceback (most recent call last): File "/home/tiger/.local/lib/python3.10/site-packages/pynvml/nvml.py", line 850, in _nvmlGetFunctionPointer _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name) File "/usr/local/lib/python3.10/ctypes/init.py", line 387, in getattr func = self.getitem(name) File "/usr/local/lib/python3.10/ctypes/init.py", line 392, in getitem func = self._FuncPtr((name_or_ordinal, self)) AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetMemoryInfo_v2
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/mnt/bn/xiaxin/wulenan/workspace/TensorRT-LLM/examples/bloom/build.py", line 556, in
build(0, args)
File "/mnt/bn/xiaxin/wulenan/workspace/TensorRT-LLM/examples/bloom/build.py", line 524, in build
engine = build_rank_engine(builder, builder_config, engine_name,
File "/mnt/bn/xiaxin/wulenan/workspace/TensorRT-LLM/examples/bloom/build.py", line 337, in build_rank_engine
profiler.print_memory_usage(f'Rank {rank} Engine build starts')
File "/home/tiger/.local/lib/python3.10/site-packages/tensorrt_llm/profiler.py", line 197, in print_memory_usage
alloc_devicemem, , _ = device_memory_info(device=device)
File "/home/tiger/.local/lib/python3.10/site-packages/tensorrt_llm/profiler.py", line 148, in device_memory_info
mem_info = _device_get_memory_info_fn(handle)
File "/home/tiger/.local/lib/python3.10/site-packages/pynvml/nvml.py", line 2438, in nvmlDeviceGetMemoryInfo
fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo_v2")
File "/home/tiger/.local/lib/python3.10/site-packages/pynvml/nvml.py", line 853, in _nvmlGetFunctionPointer
raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found