NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.9k stars 2.14k forks source link

Error Code 1: Myelin ([cask.cpp:exec:972] Platform (Cuda) error #4080

Open jaiswackhv opened 3 months ago

jaiswackhv commented 3 months ago

Whenever I am running MLperf Inferencing for Llama2-70b in a docker container, I am getting this below error. I deleted the container image and run again but still same error. Host server is running RHEL9.2 with 8 x H100 80GB GPUs, with high-performance wekafs file storage mounted with Nvidia GDS.

[TensorRT-LLM][ERROR] 1: [runner.cpp::executeMyelinGraph::682] Error Code 1: Myelin ([cask.cpp:exec:972] Platform (Cuda) error) [TensorRT-LLM][ERROR] Encountered an error in forward function: Executing TRT engine failed! [TensorRT-LLM][WARNING] Step function failed, continuing.

These RPMs are installed in the host server. cm-nvidia-container-toolkit-1.14.2-100070_cm10.0_6ea8822f81.x86_64 nvidia-driver-cuda-libs-550.90.07-1.el9.x86_64 nvidia-libXNVCtrl-550.90.07-2.el9.x86_64 nvidia-driver-NVML-550.90.07-1.el9.x86_64 nvidia-driver-NvFBCOpenGL-550.90.07-1.el9.x86_64 nvidia-driver-libs-550.90.07-1.el9.x86_64 nvidia-persistenced-550.90.07-1.el9.x86_64 nvidia-driver-cuda-550.90.07-1.el9.x86_64 dnf-plugin-nvidia-2.2-1.el9.noarch kmod-nvidia-open-dkms-550.90.07-1.el9.x86_64 nvidia-kmod-common-550.90.07-1.el9.noarch nvidia-driver-550.90.07-1.el9.x86_64 nvidia-modprobe-550.90.07-2.el9.x86_64 nvidia-settings-550.90.07-2.el9.x86_64 nvidia-xconfig-550.90.07-2.el9.x86_64 nvidia-driver-devel-550.90.07-1.el9.x86_64 nvidia-libXNVCtrl-devel-550.90.07-2.el9.x86_64 nvidia-fabric-manager-550.90.07-1.x86_64 nvidia-gds-12-5-12.5.1-1.x86_64 nvidia-gds-12.5.1-1.x86_64 nvidia-fs-dkms-2.22.3-1.x86_64 nvidia-fs-2.22.3-1.x86_64 [root@hxxxx ~]# rpm -qa |grep -i cuda cuda-dcgm-libs-3.3.6.1-100101_cm10.0_463140abaf.x86_64 nvidia-driver-cuda-libs-550.90.07-1.el9.x86_64 nvidia-driver-cuda-550.90.07-1.el9.x86_64 cuda-toolkit-config-common-12.5.82-1.noarch cuda-toolkit-12-config-common-12.5.82-1.noarch cuda-toolkit-12-5-config-common-12.5.82-1.noarch

RHEL9.2 kernel: 5.14.0-284.30.1.el9_2.x86_64

jaiswackhv commented 3 months ago

Docker container where mlperf test is running. nvcr.io/nvidia/mlperf/mlperf-inference mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public

lishicheng1996 commented 2 months ago

Docker container where mlperf test is running. nvcr.io/nvidia/mlperf/mlperf-inference mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public

Hi, have you solve this problem? I come across similar problem~ Image

moraxu commented 2 months ago

Could you repost your issue on https://github.com/NVIDIA/TensorRT-LLM/issues, please?

cloudhan commented 2 days ago

This is probably cause by the environment (hardware and software) of engine file generation and execution is not the same. The user encounter this better try to delete the engine files and regen in the current environment to verify if it is the case.