Open TranSirius opened 4 days ago
Hi @hello-11 ,
could you try to update weights after building the engine, but before running the C++ runtime:
trtllm-refit --checkpoint_dir <checkpoint_dir> --engine_dir <refittable_engine> --output_dir <output_new_engine>
Please, refer to docs for more info: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/sample_weight_stripping#engine-refitter.
I'll close the issue, but feel free to reopen, if you have more questions.
Hi @nekorobov ,
Thanks for your quick response! I have found the trtllm-refit
command. However, it does not work in my use case.
In my use case, I adopt TRTLLM engine as the actor for RLHF, so I need to frequently update the parameters in the TRTLLM engine. If I refit the engine by using trtllm-refit
before initialize the Executor, I need to save the checkpoint to the disk first and then refit the engine with that checkpoint on the disk. After all these processes finished, I need to re-initialize the Executor by calling its from_dir
method.
In this scenario, It takes about 50 seconds to finish the inference process and 10 seconds to train the model for one step. However, it takes about 3 minutes to save the updated checkpoint --> refit by calling trtllm-refit --> load the executor again, which would be the most significant bottleneck in the whole process.
It would be nice if I could perform the refitting operation in the memory directly without having to refit on the disk. I am not sure whether it is possible with the C++ runtime. Much appreciate for your help!
🤔 It seems that I do not have the permission to reopen this Issue. Not sure whether this issue is still under tracking
Hi @TranSirius ,
refitting in the C++ runtime directly in device memory without saving on disk is currently not supported. You can try to add logic manually, but I am not 100% sure about the feasibility and applicability for your use-case. You can check how the onnx parser refit works in https://github.com/onnx/onnx-tensorrt/blob/main/ModelRefitter.cpp
Particularly interesting functions are getRefittableWeights
, setNamedWeights
. Refer to the development guide for documentation https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#refitting-engine-c
I am using TensorRT-LLM in version 0.12.0 with the following setups:
engine = build(model, build_config, return_build_config)
function in thebuilder.py
file. Save the Engine to the disk viaengine.save()
decoder = ModelRunnerCpp.from_dir(engine_dir, rank = model_parallel_rank)
Now, the
from_dir()
function will create a Cpp runtime viaHowever, the Executor is not compatible with the refitter functionality from TensorRT. As far as I am concerned, it seems that online refitting is only supported by the Python runtime and the in-flight batching feature is only supported by the Cpp runtime. I am writing this issue to confirm whether it is possible to enable both online refitting and IFB with current version of TRTLLM?
For further clarification, online refitting refers to the process to update the model parameters in the TRTLLM engine by passing weight dict in the memory directly without serializing the whole model and offloading the engine to the disk. (Just in case I am using the wrong wording.)
Much thanks for your help!