NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.34k stars 794 forks source link

cluster key option not working? #1807

Open tonylek opened 1 week ago

tonylek commented 1 week ago

Hi, I tried the --cluster-key option with trtllm-build. I did the conversion with A100-80gb-sxm, then tried to deploy it on L4 after converting using the L4 option and it failed when starting up the tritonserver.

When I'm trying to deploy it on a100-40gb-sxm with its relevant option it does start-up work but I'm getting: [TensorRT-LLM][WARNING] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.

In my config.json there appears to be the cluster_key I converted with.

yuxianq commented 1 week ago

--cluster_key is used with --auto_parallel N, it can provide cluster info to determine auto parallel's sharding strategy. In other word, --cluster_key does not help to cross-build engine to different type of GPUs, which is unsupported in TRT-LLM. Since you does not use --auto_parallel N, --cluster_key should make no effect to the build process. I will update the help message to avoid confusion when users does not use --auto_parallel N.

tonylek commented 1 week ago

any chance it will be supported in the future?

yuxianq commented 1 week ago

The cross-build feature is not planned. Please build and deploy on the same type of GPU.

nv-guomingz commented 3 days ago

@tonylek could we close this ticket now?