NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.5k stars 964 forks source link

UnsupportedOperatorError: ONNX export failed on an operator with unrecognized namespace flash_attn::_flash_attn_forward. If you are trying to export a custom operator, make sure you registered it with the right domain and version. #2369

Open scuizhibin opened 2 days ago

scuizhibin commented 2 days ago

system-info : GPU:3090 GpuDRIVER:550.107.02 Ubuntu :22.04

Using multimodal examples,python build_visual_engine.py --model_type phi-3-vision --model_path tmp/hf_models/${MODEL_NAME}

error :UnsupportedOperatorError: ONNX export failed on an operator with unrecognized namespace flash_attn::_flash_attn_forward. If you are trying to export a custom operator, make sure you registered it with the right domain and version.

symphonylyh commented 1 day ago

Hi @scuizhibin , yes, there is currently some compatibility issues between HuggingFace and torch.onnx, due to recent HF 4.45 changes. The solution might be manually switching the attention implemention to eager, https://github.com/huggingface/transformers/blob/v4.45.1/src/transformers/modeling_utils.py#L3105-L3106

We're investigating to see if there is workaround we can do to resolve this from TRT-LLM side