NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.56k stars 2.1k forks source link

Error Code 1: Myelin ([cask.cpp:exec:972] Platform (Cuda) error #4080

Open jaiswackhv opened 1 month ago

jaiswackhv commented 1 month ago

Whenever I am running MLperf Inferencing for Llama2-70b in a docker container, I am getting this below error. I deleted the container image and run again but still same error. Host server is running RHEL9.2 with 8 x H100 80GB GPUs, with high-performance wekafs file storage mounted with Nvidia GDS.

[TensorRT-LLM][ERROR] 1: [runner.cpp::executeMyelinGraph::682] Error Code 1: Myelin ([cask.cpp:exec:972] Platform (Cuda) error) [TensorRT-LLM][ERROR] Encountered an error in forward function: Executing TRT engine failed! [TensorRT-LLM][WARNING] Step function failed, continuing.

These RPMs are installed in the host server. cm-nvidia-container-toolkit-1.14.2-100070_cm10.0_6ea8822f81.x86_64 nvidia-driver-cuda-libs-550.90.07-1.el9.x86_64 nvidia-libXNVCtrl-550.90.07-2.el9.x86_64 nvidia-driver-NVML-550.90.07-1.el9.x86_64 nvidia-driver-NvFBCOpenGL-550.90.07-1.el9.x86_64 nvidia-driver-libs-550.90.07-1.el9.x86_64 nvidia-persistenced-550.90.07-1.el9.x86_64 nvidia-driver-cuda-550.90.07-1.el9.x86_64 dnf-plugin-nvidia-2.2-1.el9.noarch kmod-nvidia-open-dkms-550.90.07-1.el9.x86_64 nvidia-kmod-common-550.90.07-1.el9.noarch nvidia-driver-550.90.07-1.el9.x86_64 nvidia-modprobe-550.90.07-2.el9.x86_64 nvidia-settings-550.90.07-2.el9.x86_64 nvidia-xconfig-550.90.07-2.el9.x86_64 nvidia-driver-devel-550.90.07-1.el9.x86_64 nvidia-libXNVCtrl-devel-550.90.07-2.el9.x86_64 nvidia-fabric-manager-550.90.07-1.x86_64 nvidia-gds-12-5-12.5.1-1.x86_64 nvidia-gds-12.5.1-1.x86_64 nvidia-fs-dkms-2.22.3-1.x86_64 nvidia-fs-2.22.3-1.x86_64 [root@hxxxx ~]# rpm -qa |grep -i cuda cuda-dcgm-libs-3.3.6.1-100101_cm10.0_463140abaf.x86_64 nvidia-driver-cuda-libs-550.90.07-1.el9.x86_64 nvidia-driver-cuda-550.90.07-1.el9.x86_64 cuda-toolkit-config-common-12.5.82-1.noarch cuda-toolkit-12-config-common-12.5.82-1.noarch cuda-toolkit-12-5-config-common-12.5.82-1.noarch

RHEL9.2 kernel: 5.14.0-284.30.1.el9_2.x86_64

jaiswackhv commented 1 month ago

Docker container where mlperf test is running. nvcr.io/nvidia/mlperf/mlperf-inference mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public

lishicheng1996 commented 1 week ago

Docker container where mlperf test is running. nvcr.io/nvidia/mlperf/mlperf-inference mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public

Hi, have you solve this problem? I come across similar problem~ Image

moraxu commented 2 days ago

Could you repost your issue on https://github.com/NVIDIA/TensorRT-LLM/issues, please?