Open jaiswackhv opened 3 months ago
Docker container where mlperf test is running. nvcr.io/nvidia/mlperf/mlperf-inference mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public
Docker container where mlperf test is running. nvcr.io/nvidia/mlperf/mlperf-inference mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public
Hi, have you solve this problem? I come across similar problem~
Could you repost your issue on https://github.com/NVIDIA/TensorRT-LLM/issues, please?
This is probably cause by the environment (hardware and software) of engine file generation and execution is not the same. The user encounter this better try to delete the engine files and regen in the current environment to verify if it is the case.
Whenever I am running MLperf Inferencing for Llama2-70b in a docker container, I am getting this below error. I deleted the container image and run again but still same error. Host server is running RHEL9.2 with 8 x H100 80GB GPUs, with high-performance wekafs file storage mounted with Nvidia GDS.
[TensorRT-LLM][ERROR] 1: [runner.cpp::executeMyelinGraph::682] Error Code 1: Myelin ([cask.cpp:exec:972] Platform (Cuda) error) [TensorRT-LLM][ERROR] Encountered an error in forward function: Executing TRT engine failed! [TensorRT-LLM][WARNING] Step function failed, continuing.
These RPMs are installed in the host server. cm-nvidia-container-toolkit-1.14.2-100070_cm10.0_6ea8822f81.x86_64 nvidia-driver-cuda-libs-550.90.07-1.el9.x86_64 nvidia-libXNVCtrl-550.90.07-2.el9.x86_64 nvidia-driver-NVML-550.90.07-1.el9.x86_64 nvidia-driver-NvFBCOpenGL-550.90.07-1.el9.x86_64 nvidia-driver-libs-550.90.07-1.el9.x86_64 nvidia-persistenced-550.90.07-1.el9.x86_64 nvidia-driver-cuda-550.90.07-1.el9.x86_64 dnf-plugin-nvidia-2.2-1.el9.noarch kmod-nvidia-open-dkms-550.90.07-1.el9.x86_64 nvidia-kmod-common-550.90.07-1.el9.noarch nvidia-driver-550.90.07-1.el9.x86_64 nvidia-modprobe-550.90.07-2.el9.x86_64 nvidia-settings-550.90.07-2.el9.x86_64 nvidia-xconfig-550.90.07-2.el9.x86_64 nvidia-driver-devel-550.90.07-1.el9.x86_64 nvidia-libXNVCtrl-devel-550.90.07-2.el9.x86_64 nvidia-fabric-manager-550.90.07-1.x86_64 nvidia-gds-12-5-12.5.1-1.x86_64 nvidia-gds-12.5.1-1.x86_64 nvidia-fs-dkms-2.22.3-1.x86_64 nvidia-fs-2.22.3-1.x86_64 [root@hxxxx ~]# rpm -qa |grep -i cuda cuda-dcgm-libs-3.3.6.1-100101_cm10.0_463140abaf.x86_64 nvidia-driver-cuda-libs-550.90.07-1.el9.x86_64 nvidia-driver-cuda-550.90.07-1.el9.x86_64 cuda-toolkit-config-common-12.5.82-1.noarch cuda-toolkit-12-config-common-12.5.82-1.noarch cuda-toolkit-12-5-config-common-12.5.82-1.noarch
RHEL9.2 kernel: 5.14.0-284.30.1.el9_2.x86_64