Deploy DeBERTa to Triton Inference Server

NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

Apache License 2.0

10.79k stars 2.13k forks source link

I followed the steps in the DeBERTa guide to create the modified onnx file with the plugin. When I try using this model with triton inference server, it says

Internal: onnx runtime error 9: Could not find an implementation for DisentangledAttention_TRT(1) node with name 'onnx_graphsurgeon_node_0'

Is there a way to get this to work in triton? I'm using triton 24.09.

I can confirm that I was able to get the onnx model working fine when using the onnxruntime package in a python script. It works fine in triton if I don't use the plugin

Slightly separate issue that might be better for another issue: The model in fp16 is garbage, even with the layernorm in fp32.

NVIDIA / TensorRT

Deploy DeBERTa to Triton Inference Server #4202