GPTGenerateTRTLLM.trt_llm_exporter.refit failed due to empty weights in the refit engine during PPO actor training

Describe the bug

We followed Accelerated-RLHF.md to run the accelerate the PPO training by using TensorRT-LLM. After launching the reward model and critic server, we launched the initial policy and PPO actor training. We encountered an error at the beginning of the 2nd step of the PPO actor training:

RuntimeError: RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Failed to refit transformer.layers.0.input_layernorm.bias (/opt/TensorRT-LLM/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:229)
1       0x7fcb9b2f2ee4 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2       0x7fcb9b309254 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0xd8254) [0x7fcb9b309254]
3       0x7fcb9b429e88 tensorrt_llm::runtime::GptSession::refitEngine(std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, nvinfer1::Weights>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, nvinfer1::Weights> > >) + 376
4       0x7fcb9ed8893f /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xc093f) [0x7fcb9ed8893f]
5       0x7fcb9ed2b9ce /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x639ce) [0x7fcb9ed2b9ce]
6       0x7fcb9ed1419b /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x4c19b) [0x7fcb9ed1419b]
7       0x55f7b066610e python(+0x15a10e) [0x55f7b066610e]
8       0x55f7b065ca7b _PyObject_MakeTpCall + 603
9       0x55f7b0674acb python(+0x168acb) [0x55f7b0674acb]
10      0x55f7b0654cfa _PyEval_EvalFrameDefault + 24906
11      0x55f7b06747f1 python(+0x1687f1) [0x55f7b06747f1]
12      0x55f7b065053c _PyEval_EvalFrameDefault + 6540
13      0x55f7b06669fc _PyFunction_Vectorcall + 124
14      0x55f7b064f45c _PyEval_EvalFrameDefault + 2220
15      0x55f7b06747f1 python(+0x1687f1) [0x55f7b06747f1]
16      0x55f7b0654cfa _PyEval_EvalFrameDefault + 24906
17      0x55f7b06669fc _PyFunction_Vectorcall + 124
18      0x55f7b06515d7 _PyEval_EvalFrameDefault + 10791
19      0x55f7b06669fc _PyFunction_Vectorcall + 124
20      0x55f7b064f45c _PyEval_EvalFrameDefault + 2220
21      0x55f7b06669fc _PyFunction_Vectorcall + 124
22      0x55f7b064f45c _PyEval_EvalFrameDefault + 2220
23      0x55f7b06669fc _PyFunction_Vectorcall + 124
24      0x55f7b064f26d _PyEval_EvalFrameDefault + 1725
25      0x55f7b06669fc _PyFunction_Vectorcall + 124
26      0x55f7b065053c _PyEval_EvalFrameDefault + 6540
27      0x55f7b06747f1 python(+0x1687f1) [0x55f7b06747f1]
28      0x55f7b065053c _PyEval_EvalFrameDefault + 6540
29      0x55f7b06669fc _PyFunction_Vectorcall + 124
30      0x55f7b064f26d _PyEval_EvalFrameDefault + 1725
31      0x55f7b06669fc _PyFunction_Vectorcall + 124
32      0x55f7b064f26d _PyEval_EvalFrameDefault + 1725
33      0x55f7b06669fc _PyFunction_Vectorcall + 124
34      0x55f7b065053c _PyEval_EvalFrameDefault + 6540
35      0x55f7b06669fc _PyFunction_Vectorcall + 124
36      0x55f7b065053c _PyEval_EvalFrameDefault + 6540
37      0x55f7b06669fc _PyFunction_Vectorcall + 124
38      0x55f7b064f26d _PyEval_EvalFrameDefault + 1725
39      0x55f7b064b9c6 python(+0x13f9c6) [0x55f7b064b9c6]
40      0x55f7b0741256 PyEval_EvalCode + 134
41      0x55f7b076c108 python(+0x260108) [0x55f7b076c108]
42      0x55f7b07659cb python(+0x2599cb) [0x55f7b07659cb]
43      0x55f7b076be55 python(+0x25fe55) [0x55f7b076be55]
15      0x5592a5a207f1 python(+0x1687f1) [0x5592a5a207f1]
16      0x5592a5a00cfa _PyEval_EvalFrameDefault + 24906
17      0x5592a5a129fc _PyFunction_Vectorcall + 124
18      0x5592a59fd5d7 _PyEval_EvalFrameDefault + 10791
19      0x5592a5a129fc _PyFunction_Vectorcall + 124
20      0x5592a59fb45c _PyEval_EvalFrameDefault + 2220
21      0x5592a5a129fc _PyFunction_Vectorcall + 124
22      0x5592a59fb45c _PyEval_EvalFrameDefault + 2220
23      0x5592a5a129fc _PyFunction_Vectorcall + 124
24      0x5592a59fb26d _PyEval_EvalFrameDefault + 1725
25      0x5592a5a129fc _PyFunction_Vectorcall + 124
26      0x5592a59fc53c _PyEval_EvalFrameDefault + 6540
27      0x5592a5a207f1 python(+0x1687f1) [0x5592a5a207f1]
28      0x5592a59fc53c _PyEval_EvalFrameDefault + 6540
29      0x5592a5a129fc _PyFunction_Vectorcall + 124
30      0x5592a59fb26d _PyEval_EvalFrameDefault + 1725
31      0x5592a5a129fc _PyFunction_Vectorcall + 124
32      0x5592a59fb26d _PyEval_EvalFrameDefault + 1725
33      0x5592a5a129fc _PyFunction_Vectorcall + 124
34      0x5592a59fc53c _PyEval_EvalFrameDefault + 6540
35      0x5592a5a129fc _PyFunction_Vectorcall + 124
36      0x5592a59fc53c _PyEval_EvalFrameDefault + 6540
37      0x5592a5a129fc _PyFunction_Vectorcall + 124
38      0x5592a59fb26d _PyEval_EvalFrameDefault + 1725
39      0x5592a59f79c6 python(+0x13f9c6) [0x5592a59f79c6]
40      0x5592a5aed256 PyEval_EvalCode + 134
41      0x5592a5b18108 python(+0x260108) [0x5592a5b18108]
42      0x5592a5b119cb python(+0x2599cb) [0x5592a5b119cb]
43      0x5592a5b17e55 python(+0x25fe55) [0x5592a5b17e55]
44      0x5592a5b17338 _PyRun_SimpleFileObject + 424
45      0x5592a5b16f83 _PyRun_AnyFileObject + 67
46      0x5592a5b09a5e Py_RunMain + 702
47      0x5592a5ae002d Py_BytesMain + 45
48      0x7fabb9b06d90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fabb9b06d90]
49      0x7fabb9b06e40 __libc_start_main + 128
50      0x5592a5adff25 _start + 37

According to the comment of setNamedWeights in TensorRT v.9.3.0, setNamedWeights may fail due to two reasons:

The name of weights is nullptr or does not correspond to any refittable weights.
The number of weights is inconsistent with the original specification.

To debug this error, we retrieved the existing weights given the name in TensorRT-LLM (see commit aff4e0f5) and found that they are empty, i.e., values == nullptr and count == 0. Note that the 1st step was completed successfully, which implies that the engine was compiled and was able to generate the response.

Steps/Code to reproduce bug

We can reproduce this bug on a p4DE with 8 A100 GPUs.

Build a docker image of NeMo-Aligner v0.3.0.trtllm. Since we ran into several unexpected issues when we were using the docker image build from NeMo-Aligner v0.3.0.trtllm Dockerfile, we applied our fixes in NeMo (see forked branch), added more logging in TensorRT-LLM (see forked branch), and built a docker image from this Dockerfile: https://github.com/renweizhukov/NeMo-Aligner/blob/v0.3.0.trtllm-fix/Dockerfile
```
$ git clone https://github.com/renweizhukov/NeMo-Aligner.git
$ cd NeMo-Aligner
$ git checkout v0.3.0.trtllm-fix
$ docker build -t nemo-aligner-trt-llm .
```

Launch the docker.

$ export IMAGE_TRTLLM="nemo-aligner-trt-llm"
$ docker run -itd --gpus all --net=host --ipc=host --privileged --shm-size=512g \
--ulimit memlock=-1 --ulimit stack=67108864 --name=nemoaligner_ppo_trtllm \
-v /workplace/:/workplace $IMAGE_TRTLLM

Run the critic server in the docker.

$ docker exec -it nemoaligner_ppo_trtllm bash 

# In the docker:
$ CHECKPOINT_NEMO_FILE="[reward-model-checkpoint-path]"
$ GPFS="/workplace/NeMo-Aligner"
$ RESULTS_DIR="[results-dir]"
$ TP_SIZE=4
$ PP_SIZE=1
$ CRITIC_PORT=5567
# We use the first 4 GPUs for the critic server.
$ export CUDA_VISIBLE_DEVICES=0,1,2,3

$ export PYTHONPATH="${GPFS}:${PYTHONPATH}" \
&& export HYDRA_FULL_ERROR=1 \
&& python -u ${GPFS}/examples/nlp/gpt/serve_ppo_critic.py \
 exp_manager.create_wandb_logger=False \
 exp_manager.explicit_log_dir=${RESULTS_DIR} \
 exp_manager.wandb_logger_kwargs.name=critic_training \
 exp_manager.wandb_logger_kwargs.project=nemo_aligner_ppo \
 ++model.activations_checkpoint_granularity=null \
 ++model.mcore_gpt=True \
 ++model.megatron_amp_O2=True \
 ++model.offload_adam_states=True \
 ++model.pipeline_model_parallel_size=${PP_SIZE} \
 ++model.tensor_model_parallel_size=${TP_SIZE} \
 ++pretrained_checkpoint.restore_from_path=${CHECKPOINT_NEMO_FILE} \
 trainer.devices=4 \
 trainer.num_nodes=1 \
 trainer.ppo.port=${CRITIC_PORT} \
 trainer.ppo.inference_micro_batch_size=4 \
 ++trainer.ppo.combine_rm_and_critic_server=True

Run the PPO actor training in the docker.

$ docker exec -it nemoaligner_ppo_trtllm bash

# In the docker:
$ cd /workplace/NeMo-Aligner/examples/nlp/gpt 
$ GPFS="/workplace/NeMo-Aligner"
$ TRAIN_DATA_PATH="[train-data-path]"
$ VALID_DATA_PATH="[valid-data-path]"
$ PRETRAINED_ACTOR_NEMO_FILE="[GPT-2B-nemo-checkpoint-path]"
$ RESULTS_DIR="[results-dir]"
$ TP_SIZE=4
$ PP_SIZE=1
$ MAX_EPOCHS=1
$ MAX_STEPS=15
$ CRITIC_IP="0.0.0.0"
$ CRITIC_PORT=5567

# We use the remaining 4 GPUs for PPO actor training.
$ export CUDA_VISIBLE_DEVICES=4,5,6,7

$ export PYTHONPATH="${GPFS}:${PYTHONPATH}" \
&& export HYDRA_FULL_ERROR=1 \
&& mpirun -n 4 --allow-run-as-root python -u ${GPFS}/examples/nlp/gpt/train_gpt_ppo_actor.py \
 exp_manager.create_wandb_logger=False \
 exp_manager.explicit_log_dir=${RESULTS_DIR} \
 exp_manager.wandb_logger_kwargs.name=ppo_actor_training \
 exp_manager.wandb_logger_kwargs.project=nemo_aligner_ppo \
 +exp_manager.mlflow_logger_kwargs.experiment_name=${USER}-ppo-train-local \
 pretrained_checkpoint.restore_from_path=${PRETRAINED_ACTOR_NEMO_FILE} \
 trainer.devices=4 \
 trainer.num_nodes=1 \
 trainer.ppo.flask_server.enable=True \
 trainer.ppo.initial_policy_kl_penalty=0.02 \
 trainer.ppo.max_epochs=${MAX_EPOCHS} \
 trainer.ppo.max_steps=${MAX_STEPS} \
 trainer.ppo.trt_llm.enable=True \
 trainer.ppo.trt_llm.reshard=True \
 trainer.ppo.val_check_interval=3 \
 ++trainer.ppo.normalize_advantages=True \
 ++model.activations_checkpoint_granularity=selective \
 ++model.activations_checkpoint_method=uniform \
 ++model.data.data_impl=jsonl \
 "++model.data.data_prefix={train: [${TRAIN_DATA_PATH}], validation: [${VALID_DATA_PATH}], test: [${VALID_DATA_PATH}]}" \
 ++model.global_batch_size=16 \
 ++model.mcore_gpt=True \
 ++model.megatron_amp_O2=True \
 ++model.micro_batch_size=1 \
 ++model.optim.lr=9e-7 \
 ++model.optim.sched.min_lr=9e-8 \
 ++model.pipeline_model_parallel_size=${PP_SIZE} \
 ++model.ppo.entropy_bonus=0.0 \
 ++model.ppo.length_params.max_length=1024 \
 ++model.ppo.num_rollout_samples=16 \
 ++model.ppo.offload_adam_states=True \
 ++model.ppo.ratio_eps=0.2 \
 ++model.ppo.rollout_micro_batch_size=4 \
 ++model.tensor_model_parallel_size=${TP_SIZE} \
 remote_critic_rm.combine_rm_and_critic_server=True \
 remote_critic_rm.critic.ip=${CRITIC_IP} \
 remote_critic_rm.critic.port=${CRITIC_PORT}

Expected behavior

We expect the PPO actor training job to succeed.

Environment overview (please complete the following information)

Environment location: Docker
Method of NeMo-Aligner install: build from Dockerfiler: https://github.com/renweizhukov/NeMo-Aligner/blob/v0.3.0.trtllm-fix/Dockerfile

If method of install is [Docker], provide docker pull & docker run commands used

$ git clone https://github.com/renweizhukov/NeMo-Aligner.git
$ cd NeMo-Aligner
$ git checkout v0.3.0.trtllm-fix
$ docker build -t nemo-aligner-trt-llm .

Environment details

If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:

OS version
PyTorch version
Python version

Additional context

Add any other context about the problem here.

8 NVIDIA A100-SXM4-80GB

NVIDIA / NeMo-Aligner

GPTGenerateTRTLLM.trt_llm_exporter.refit failed due to empty weights in the refit engine during PPO actor training #264