Open renweizhukov opened 1 month ago
Attached are the model_config.yaml
of the 3 models for which we hit this issue, i.e., GPT-2B, Mistral-7B-Instruct-v0.2, and tulu-2-7b. Note that Github does not allow the upload of yaml files, so we changed the file extension to .txt
.
Describe the bug
We followed Accelerated-RLHF.md to run the accelerate the PPO training by using TensorRT-LLM. After launching the reward model and critic server, we launched the initial policy and PPO actor training. We encountered an error at the beginning of the 2nd step of the PPO actor training:
According to the comment of
setNamedWeights
in TensorRT v.9.3.0,setNamedWeights
may fail due to two reasons:To debug this error, we retrieved the existing weights given the name in TensorRT-LLM (see commit aff4e0f5) and found that they are empty, i.e.,
values == nullptr
andcount == 0
. Note that the 1st step was completed successfully, which implies that the engine was compiled and was able to generate the response.Steps/Code to reproduce bug
We can reproduce this bug on a p4DE with 8 A100 GPUs.
Build a docker image of NeMo-Aligner v0.3.0.trtllm. Since we ran into several unexpected issues when we were using the docker image build from NeMo-Aligner v0.3.0.trtllm Dockerfile, we applied our fixes in NeMo (see forked branch), added more logging in TensorRT-LLM (see forked branch), and built a docker image from this Dockerfile: https://github.com/renweizhukov/NeMo-Aligner/blob/v0.3.0.trtllm-fix/Dockerfile
Launch the docker.
Run the critic server in the docker.
Run the PPO actor training in the docker.
Expected behavior
We expect the PPO actor training job to succeed.
Environment overview (please complete the following information)
If method of install is [Docker], provide
docker pull
&docker run
commands usedEnvironment details
If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:
Additional context
Add any other context about the problem here.
8 NVIDIA A100-SXM4-80GB