NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.71k stars 996 forks source link

How to refit model parameters in the Cpp runtime (Executor) or activate In-flight batching in the Python runtime (builder.Engine)? #2453

Open TranSirius opened 4 days ago

TranSirius commented 4 days ago

I am using TensorRT-LLM in version 0.12.0 with the following setups:

  1. Build a TensorRT-LLM engine using the engine = build(model, build_config, return_build_config) function in the builder.py file. Save the Engine to the disk via engine.save()
  2. Instantiate a ModelRunnerCpp class via: decoder = ModelRunnerCpp.from_dir(engine_dir, rank = model_parallel_rank)

Now, the from_dir() function will create a Cpp runtime via

    executor = trtllm.Executor(
        engine_dir, 
        trtllm.ModelType.DECODER_ONLY,
        trtllm_config)

However, the Executor is not compatible with the refitter functionality from TensorRT. As far as I am concerned, it seems that online refitting is only supported by the Python runtime and the in-flight batching feature is only supported by the Cpp runtime. I am writing this issue to confirm whether it is possible to enable both online refitting and IFB with current version of TRTLLM?

For further clarification, online refitting refers to the process to update the model parameters in the TRTLLM engine by passing weight dict in the memory directly without serializing the whole model and offloading the engine to the disk. (Just in case I am using the wrong wording.)

Much thanks for your help!

nekorobov commented 3 days ago

Hi @hello-11 ,

could you try to update weights after building the engine, but before running the C++ runtime: trtllm-refit --checkpoint_dir <checkpoint_dir> --engine_dir <refittable_engine> --output_dir <output_new_engine>

Please, refer to docs for more info: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/sample_weight_stripping#engine-refitter.

I'll close the issue, but feel free to reopen, if you have more questions.

TranSirius commented 3 days ago

Hi @nekorobov ,

Thanks for your quick response! I have found the trtllm-refit command. However, it does not work in my use case.

In my use case, I adopt TRTLLM engine as the actor for RLHF, so I need to frequently update the parameters in the TRTLLM engine. If I refit the engine by using trtllm-refit before initialize the Executor, I need to save the checkpoint to the disk first and then refit the engine with that checkpoint on the disk. After all these processes finished, I need to re-initialize the Executor by calling its from_dir method.

In this scenario, It takes about 50 seconds to finish the inference process and 10 seconds to train the model for one step. However, it takes about 3 minutes to save the updated checkpoint --> refit by calling trtllm-refit --> load the executor again, which would be the most significant bottleneck in the whole process.

It would be nice if I could perform the refitting operation in the memory directly without having to refit on the disk. I am not sure whether it is possible with the C++ runtime. Much appreciate for your help!

TranSirius commented 3 days ago

🤔 It seems that I do not have the permission to reopen this Issue. Not sure whether this issue is still under tracking

nekorobov commented 18 hours ago

Hi @TranSirius ,

refitting in the C++ runtime directly in device memory without saving on disk is currently not supported. You can try to add logic manually, but I am not 100% sure about the feasibility and applicability for your use-case. You can check how the onnx parser refit works in https://github.com/onnx/onnx-tensorrt/blob/main/ModelRefitter.cpp Particularly interesting functions are getRefittableWeights, setNamedWeights. Refer to the development guide for documentation https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#refitting-engine-c