Deployment of Pruned Models

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

https://nvidia.github.io/TensorRT-LLM

Apache License 2.0

8.74k stars 1k forks source link

Deployment of Pruned Models #1600

Open qianjyM opened 6 months ago

qianjyM commented 6 months ago

Hi there,

I just want to ask that for the pruned model, how can we deploy it using TensorRT-LLM? Since the qkv dimensions in each layer are different, the model is stored using torch.save rather than save_pretrained. So I'm a little confused about how to use TensorRT-LLM with this model? Could you please give me some tips or advice?

Thanks!

byshiue commented 6 months ago

It is not supported to use different dimension in each layer. If you want to run your model, you could implement a new model based on existing model, and set different shape for each layer. It might also affect other parts like the checkpoint converter.

nv-guomingz commented 1 week ago

Hi @qianjyM do u still have further issue or question now? If not, we'll close it soon.