NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.66k stars 988 forks source link

Enhanced Efficiency in TRT-LLM through Caching of Engines #976

Open Lokiiiiii opened 9 months ago

Lokiiiiii commented 9 months ago

Introduction

I'm proposing a caching strategy for TRT-LLM to streamline the process of re-compiling engines after fine-tuning. This strategy aims to significantly reduce build times and improve overall efficiency. I invite the community to validate and provide feedback on the following approach. I will contribute a PR based on the feedback I get.

Requirements for Improved Caching

- Develop a robust hashing mechanism for the model pre-Engine build. This hash will play a crucial role in determining the usability of pre-compiled Engines.
- Revise the build process to prioritize checking for and refitting cached Engines.
- Introduce a schema for the cache to distinguish Engines compiled with varying TRT versions and build configurations.

Proposed Solution

  1. Hashing The TRT Network:
    • Serialize the TRT network before the Engine build. Given TRT's limitations in network serialization, leverage the TRT to ONNX converter in our build scripts. ONNX's capabilities for consistent serialization can be used to generate a unique hash of the serialized model, reflecting the architecture, dtypes and plugins.
  2. Build Script Modifications:
    • Update build scripts to incorporate a refit_if_cached() function between network construction and engine building phases.
  3. Versioning and Hashing:
    • Create a helper function to get the TRT version within TRT-LLM for automatic extraction. Serialize and hash the Builder class's arguments to structure the cache logically, such as TRT_version/Hash(Network)/Hash(BuilderConfig).

Anticipated Benefits

  1. Efficiency in Tensor Parallel Inference: By building the first shard and refitting for subsequent ones, we can expedite serial builds, achieving speeds comparable to parallel builds and facilitating quicker builds on less powerful machines.
  2. Time-Saving in Re-compilations: Post fine-tuning, this approach can bypass the entire build process by refitting the cached Engine, leading to a substantial time reduction (up to 40% faster in cases like llama2-70B) compared to rebuilding from scratch.

Points for Community Discussion

  1. Dependency on ONNX: The current TRT → ONNX → HASH sequence introduces a dependency on ONNX, which might not support newly implemented TRT layers. Is there a way to bypass ONNX and serialize a TRT network directly to a string? Does TRT provide any serialization helpers? When iterating over layer inputs/outputs, does TRT maintain a consistent order?
  2. Robustness of Cache Structure: How robust and consistent is the proposed cache structure theoretically? Are there additional inputs or options or environment variables in the TRT Builder class that should be hashed for consistency? Could the version of TRT-LLM or other dependencies influence the Engine Build outcome if the Hash(TRT Network) and Hash(tensorrt_llm.builder.BuilderConfig) remain constant?

I look forward to the community's insights and suggestions to refine and enhance this proposal.

jayachandrakalakutagar commented 9 months ago

Where can i get this code

litaotju commented 7 months ago

Hi @Lokiiiiii

Thanks for the comments and proposal. Since TRT-LLM evolves so quickly, adding features and optimization techniques on almost every release. The current focus is trying to provide best performance. We may consider this feature compability in the future. You could try to use model cache to see if it can solve the needs for you. Also, TRT-LLM is investing the quick compilation.

Thanks.

Tao