Error during saving checkpoint with TensorRT-enabled PPO actor training

Describe the bug

During the PPO actor training run with TensorRT-enabled, there was an error encountered during the validation checkpointing process. The training was conducted using the TensorRT-LLM setup, as suggested in the documentation available at TRT-LLM Accelerated-RLHF. The latest Nemo docker was used for the experiment.

The issue occurred specifically when the training job was attempting to save the checkpoints while using TensorRT. However, when the PPO actor training was running without the TensorRT-enabled setup, the validation checkpointing process was successful, and the checkpoints were saved without any errors.

This is the error message:

Epoch 0, global step 3: 'val_rewards' reached 0.16066 (best 0.16066), saving model to '/fsx-Training/shopqa-training-fsx-prod-us-east-1/home/hamidizd/checkpoint/mcore/trt_llama2_exp/trtllm-actor-results-nemo-docker_exp/checkpoints/megatron_gpt-step=3-consumed_samples=48-ppo_optimization_step=3-epoch=0-val_rewards=0.161.ckpt' as top 1
[NeMo I 2024-09-05 18:47:17 dist_ckpt_io:421] Using TorchDistSaveShardedStrategy(torch_dist, 1) dist-ckpt save strategy.
`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
WARNING: Logging before flag parsing goes to stderr.
W0905 18:47:33.640648 140084372412224 logger.py:92] [TRT-LLM] [W] Found pynvml==11.4.1. Please use pynvml>=11.5.0 to get accurate memory usage
WARNING: Logging before flag parsing goes to stderr.
W0905 18:47:33.706311 140221704996672 logger.py:92] [TRT-LLM] [W] Found pynvml==11.4.1. Please use pynvml>=11.5.0 to get accurate memory usage
WARNING: Logging before flag parsing goes to stderr.
W0905 18:47:33.706356 139667335407424 logger.py:92] [TRT-LLM] [W] Found pynvml==11.4.1. Please use pynvml>=11.5.0 to get accurate memory usage
WARNING: Logging before flag parsing goes to stderr.
W0905 18:47:33.730743 140472870414144 logger.py:92] [TRT-LLM] [W] Found pynvml==11.4.1. Please use pynvml>=11.5.0 to get accurate memory usage
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  getting local rank failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "No permission" (-17) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[ip-10-0-5-67.ec2.internal:95416] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[ip-10-0-5-67.ec2.internal:95414] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[ip-10-0-5-67.ec2.internal:95413] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[ip-10-0-5-67.ec2.internal:95415] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
PPO Global Step:  13%|█▎        | 2/15 [04:57<32:14, 148.81s/it, val_consumed_samples=16, val_response_lengths=972, val_prompt_lengths=282, val_generation_length=690, val_rewards=0.161, val_amount_of_samples_properly_ended=0, rollout_time=38.3, batch_iterator_init=0.0275, critic_wait=3.58, finish_inference=0.00467, generate=26.9, init_logprobs=2.38, logprobs=0.673, prepare_for_inference=4.62, train_time=6.45, validation_time=37.4, train_consumed_samples=48, train_response_lengths=777, train_prompt_lengths=181, train_generation_length=596, train_rewards=-0.0716, train_amount_of_samples_properly_ended=0, train_init_policy_kl=0, train_rewards_with_kl=-7.5e-6, train_advantages_mean=-0.0251, train_advantages_std=0.75, train_returns_mean=-0.193, train_returns_std=0.581, train_values_mean=-0.168, train_values_std=0.932, train_epoch=1]
Error executing job with overrides: ['exp_manager.create_wandb_logger=False', 'exp_manager.explicit_log_dir=/fsx-Training/home/hamidizd/checkpoint/mcore/trt_llama2_exp/trtllm-actor-results-nemo-docker_exp', 'exp_manager.wandb_logger_kwargs.name=ppo_actor_training', 'exp_manager.wandb_logger_kwargs.project=nemo_aligner_ppo', '+exp_manager.mlflow_logger_kwargs.experiment_name=-ppo-train-local', 'pretrained_checkpoint.restore_from_path=/fsx-Training/shopqa-training-fsx-prod-us-east-1/home/hamidizd/models/hf/llama2/extracted_megatron_gpt', 'trainer.devices=4', 'trainer.num_nodes=1', '+trainer.ppo.flask_server.enable=True', 'trainer.ppo.initial_policy_kl_penalty=0.02', 'trainer.ppo.max_epochs=1', 'trainer.ppo.max_steps=15', 'trainer.ppo.trt_llm.enable=True', 'trainer.ppo.trt_llm.reshard=True', 'trainer.ppo.trt_llm.model_type=llama', 'trainer.ppo.trt_llm.unload_engine_train=False', 'trainer.ppo.batch_iterator.use_flask=True', 'trainer.ppo.val_check_interval=3', '++trainer.ppo.normalize_advantages=True', '++model.activations_checkpoint_granularity=selective', '++model.activations_checkpoint_method=uniform', '++model.data.data_impl=jsonl', '++model.data.data_prefix={train: [/fsx-Training/home/hamidizd/data/ultrafeedback_nemo_data/train_prompts.jsonl], validation: [/fsx-Training/home/hamidizd/data/ultrafeedback_nemo_data/test_prompts.jsonl], test: [/fsx-Training/home/hamidizd/data/ultrafeedback_nemo_data/test_prompts.jsonl]}', '++model.global_batch_size=16', '++model.mcore_gpt=True', '++model.megatron_amp_O2=True', '++model.micro_batch_size=1', '++model.optim.lr=9e-7', '++model.optim.sched.min_lr=9e-8', '++model.pipeline_model_parallel_size=1', '++model.ppo.entropy_bonus=0.0', '++model.ppo.length_params.max_length=1024', '++model.ppo.num_rollout_samples=16', '++model.ppo.offload_adam_states=True', '++model.ppo.ratio_eps=0.2', '++model.ppo.rollout_micro_batch_size=4', '++model.tensor_model_parallel_size=4', 'remote_critic_rm.combine_rm_and_critic_server=True', 'remote_critic_rm.critic.ip=0.0.0.0', 'remote_critic_rm.critic.port=5567']
[rank0]: Traceback (most recent call last):
[rank0]:   File "/opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_ppo_actor.py", line 197, in <module>
[rank0]:     main()
[rank0]:   File "/opt/NeMo/nemo/core/config/hydra_runner.py", line 129, in wrapper
[rank0]:     _run_hydra(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
[rank0]:     _run_app(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
[rank0]:     run_and_report(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
[rank0]:     raise ex
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
[rank0]:     return func()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
[rank0]:     lambda: hydra.run(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
[rank0]:     _ = ret.return_value
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
[rank0]:     raise self._return_value
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
[rank0]:     ret.return_value = task_function(task_cfg)
[rank0]:   File "/opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_ppo_actor.py", line 184, in main
[rank0]:     ppo_trainer.fit()
[rank0]:   File "/opt/NeMo-Aligner/nemo_aligner/algorithms/ppo.py", line 588, in fit
[rank0]:     self.save(step_metrics, is_train_end=is_train_end)
[rank0]:   File "/opt/NeMo-Aligner/nemo_aligner/algorithms/ppo.py", line 633, in save
[rank0]:     self.ckpt_callback.custom_save(monitor_candidates=monitor_candidates, is_train_end=is_train_end)
[rank0]:   File "/opt/NeMo-Aligner/nemo_aligner/utils/utils.py", line 62, in custom_save_ckpt_func
[rank0]:     super(NeMoModelCheckpoint, self)._save_topk_checkpoint(trainer, monitor_candidates)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 383, in _save_topk_checkpoint
[rank0]:     self._save_monitor_checkpoint(trainer, monitor_candidates)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 703, in _save_monitor_checkpoint
[rank0]:     self._update_best_and_save(current, trainer, monitor_candidates)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 755, in _update_best_and_save
[rank0]:     self._save_checkpoint(trainer, filepath)
[rank0]:   File "/opt/NeMo/nemo/utils/callbacks/nemo_model_checkpoint.py", line 470, in _save_checkpoint
[rank0]:     trainer.save_checkpoint(filepath, self.save_weights_only, storage_options=storage_options)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1370, in save_checkpoint
[rank0]:     self.strategy.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
[rank0]:   File "/opt/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 392, in save_checkpoint
[rank0]:     self.checkpoint_io.save_checkpoint(checkpoint, ckpt_to_dir(filepath), storage_options=storage_options)
[rank0]:   File "/usr/lib/python3.10/contextlib.py", line 79, in inner
[rank0]:     return func(*args, **kwds)
[rank0]:   File "/opt/NeMo/nemo/utils/callbacks/dist_ckpt_io.py", line 271, in save_checkpoint
[rank0]:     return dist_checkpointing.save(
[rank0]:   File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 394, in save
[rank0]:     sharded_strategy.save(sharded_state_dict, checkpoint_dir)
[rank0]:   File "/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/base.py", line 180, in save
[rank0]:     async_request = self.async_save(sharded_state_dict, checkpoint_dir)
[rank0]:   File "/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/torch.py", line 633, in async_save
[rank0]:     ) = save_state_dict_async_plan(
[rank0]:   File "/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/state_dict_saver.py", line 115, in save_state_dict_async_plan
[rank0]:     storage_writer.prepare_write_data(central_plan, planner)
[rank0]:   File "/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/filesystem_async.py", line 135, in prepare_write_data
[rank0]:     self.results_queue = _get_write_results_queue()
[rank0]:   File "/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/filesystem_async.py", line 34, in _get_write_results_queue
[rank0]:     _results_queue = ctx.Manager().Queue()
[rank0]:   File "/usr/lib/python3.10/multiprocessing/context.py", line 57, in Manager
[rank0]:     m.start()
[rank0]:   File "/usr/lib/python3.10/multiprocessing/managers.py", line 566, in start
[rank0]:     self._address = reader.recv()
[rank0]:   File "/usr/lib/python3.10/multiprocessing/connection.py", line 250, in recv
[rank0]:     buf = self._recv_bytes()
[rank0]:   File "/usr/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
[rank0]:     buf = self._recv(4)
[rank0]:   File "/usr/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
[rank0]:     raise EOFError
[rank0]: EOFError

Here is the list of files being saved for checkpoint when the PPO actor training running without the TensorRT-enabled setup:

__0_0.distcp  __0_1.distcp  __1_0.distcp  __1_1.distcp  __2_0.distcp  __2_1.distcp  __3_0.distcp  __3_1.distcp  common.pt  metadata.json

Additionally, when the PPO actor training was running with TensorRT-enabled but with the validation checkpointing feature disabled, the training process did not encounter any errors. Here is the log of running PPO actor training by disabling the validation checkpointing:

[NeMo I 2024-09-05 18:59:38 num_microbatches_calculator:119] setting number of micro-batches to constant 16
PPO Global Step:  20%|██        | 3/15 [05:36<16:05, 80.44s/it, rollout_time=36.6, batch_iterator_init=0.0306, critic_wait=1.3, finish_inference=0.00473, generate=27.6, init_logprobs=2.33, logprobs=0.631, prepare_for_inference=4.55, train_time=6.19, train_consumed_samples=64, train_response_lengths=779, train_prompt_lengths=84.2, train_generation_length=695, train_rewards=0.269, train_amount_of_samples_properly_ended=0, train_init_policy_kl=0, train_rewards_with_kl=-8.51e-6, train_advantages_mean=-0.00246, train_advantages_std=0.683, train_returns_mean=-0.53, train_returns_std=0.575, train_values_mean=PPO Global Step:  20%|██        | 3/15 [05:36<16:05, 80.44s/it, rollout_time=36.6, batch_iterator_init=0.0306, critic_wait=1.3, finish_inference=0.00473, generate=27.6, init_logprobs=2.33, logprobs=0.631, prepare_for_inference=4.55, train_time=6.05, train_consumed_samples=64, train_response_lengths=779, train_prompt_lengths=84.2, train_generation_length=695, train_rewards=0.269, train_amount_of_samples_properly_ended=0, train_init_policy_kl=0, train_rewards_with_kl=-8.51e-6, train_advantages_mean=-0.00246, train_advantages_std=0.683, train_returns_mean=-0.53, train_returns_std=0.575, train_values_mean=PPO Global Step:  20%|██        | 3/15 [05:36<16:05, 80.44s/it, rollout_time=36.6, batch_iterator_init=0.0306, critic_wait=1.3, finish_inference=0.00473, generate=27.6, init_logprobs=2.33, logprobs=0.631, prepare_for_inference=4.55, train_time=6.03, train_consumed_samples=64, train_response_lengths=779, train_prompt_lengths=84.2, train_generation_length=695, train_rewards=0.269, train_amount_of_samples_properly_ended=0, train_init_policy_kl=0, train_rewards_with_kl=-8.51e-6, train_advantages_mean=-0.00246, train_advantages_std=0.683, train_returns_mean=-0.53, train_returns_std=0.575, train_values_mean=PPO Global Step:  20%|██        | 3/15 [05:36<16:05, 80.44s/it, rollout_time=36.6, batch_iterator_init=0.0306, critic_wait=1.3, finish_inference=0.00473, generate=27.6, init_logprobs=2.33, logprobs=0.631, prepare_for_inference=4.55, train_time=6.05, train_consumed_samples=64, train_response_lengths=779, train_prompt_lengths=84.2, train_generation_length=695, train_rewards=0.269, train_amount_of_samples_properly_ended=0, train_init_policy_kl=0, train_rewards_with_kl=-8.51e-6, train_advantages_mean=-0.00246, train_advantages_std=0.683, train_returns_mean=-0.53, train_returns_std=0.575, train_values_mean=PPO Global Step:  27%|██▋       | 4/15 [05:36<12:05, 65.91s/it, rollout_time=36.6, batch_iterator_init=0.0306, critic_wait=1.3, finish_inference=0.00473, generate=27.6, init_logprobs=2.33, logprobs=0.631, prepare_for_inference=4.55, train_time=6.05, train_consumed_samples=64, train_response_lengths=779, train_prompt_lengths=84.2, train_generation_length=695, train_rewards=0.269, train_amount_of_samples_properly_ended=0, train_init_policy_kl=0, train_rewards_with_kl=-8.51e-6, train_advantages_mean=-0.00246, train_advantages_std=0.683, train_returns_mean=-0.53, train_returns_std=0.575, train_values_mean=PPO Global Step:  27%|██▋       | 4/15 [05:36<12:05, 65.91s/it, rollout_time=36.6, batch_iterator_init=0.0306, critic_wait=1.3, finish_inference=0.00473, generate=27.6, init_logprobs=2.33, logprobs=0.631, prepare_for_inference=4.55, train_time=6.19, train_consumed_samples=64, train_response_lengths=779, train_prompt_lengths=84.2, train_generation_length=695, train_rewards=0.269, train_amount_of_samples_properly_ended=0, train_init_policy_kl=0, train_rewards_with_kl=-8.51e-6, train_advantages_mean=-0.00246, train_advantages_std=0.683, train_returns_mean=-0.53, train_returns_std=0.575, train_values_mean=PPO Global Step:  27%|██▋       | 4/15 [05:36<12:05, 65.91s/it, rollout_time=36.6, batch_iterator_init=0.0306, critic_wait=1.3, finish_inference=0.00473, generate=27.6, init_logprobs=2.33, logprobs=0.631, prepare_for_inference=4.55, train_time=6.03, train_consumed_samples=64, train_response_lengths=779, train_prompt_lengths=84.2, train_generation_length=695, train_rewards=0.269, train_amount_of_samples_properly_ended=0, train_init_policy_kl=0, train_rewards_with_kl=-8.51e-6, train_advantages_mean=-0.00246, train_advantages_std=0.683, train_returns_mean=-0.53, train_returns_std=0.575, train_values_mean=PPO Global Step:  27%|██▋       | 4/15 [05:36<12:05, 65.91s/it, rollout_time=36.6, batch_iterator_init=0.0306, critic_wait=1.3, finish_inference=0.00473, generate=27.6, init_logprobs=2.33, logprobs=0.631, prepare_for_inference=4.55, train_time=6.05, train_consumed_samples=64, train_response_lengths=779, train_prompt_lengths=84.2, train_generation_length=695, train_rewards=0.269, train_amount_of_samples_properly_ended=0, train_init_policy_kl=0, train_rewards_with_kl=-8.51e-6, train_advantages_mean=-0.00246, train_advantages_std=0.683, train_returns_mean=-0.53, train_returns_std=0.575, train_values_mean=-0.528, train_values_std=0.874, train_epoch=1][NeMo W 2024-09-05 18:59:44 nemo_logging:349] /opt/apex/apex/contrib/optimizers/distributed_fused_adam.py:2863: UserWarning: Making optimizer state dictionary in deprecated v1 format. Future support is not guaranteed.
      warnings.warn(

saving weights: 100%|██████████| 193/193 [00:00<00:00, 13505.86it/s]
saving weights: 100%|██████████| 193/193 [00:00<00:00, 12605.31it/s]
saving weights: 100%|██████████| 193/193 [00:00<00:00, 12798.63it/s]
saving weights: 100%|██████████| 193/193 [00:00<00:00, 9967.01it/s]
[TensorRT-LLM][INFO] The logger passed into createInferRefitter differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] The logger passed into createInferRefitter differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] The logger passed into createInferRefitter differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] The logger passed into createInferRefitter differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[NeMo I 2024-09-05 18:59:49 builders:330] Building dataloader with consumed samples: 64
[NeMo I 2024-09-05 18:59:49 data_samplers:76] Instantiating MegatronPretrainingSampler with total_samples: 54818 and consumed_samples: 64
10.0.5.67 - - [05/Sep/2024 18:59:49] "PUT /get_idx HTTP/1.1" 200 -
[NeMo W 2024-09-05 18:59:56 nemo_logging:349] /opt/NeMo/nemo/collections/nlp/modules/common/text_generation_strategy.py:253: UserWarning: The end string '<extra_id_1>' has no associated special token: this may slow down generation (consider using a different tokenizer or modifying `end_strings`)
      warnings.warn(

In summary, the error was observed only during the validation checkpointing process when using the TensorRT-enabled setup. The training was successful without the TensorRT-enabled setup or when the validation checkpointing was disabled.

Steps/Code to reproduce bug

To reproduce the bug on a p4DE with 8 A100 GPUs. Pull the latest Nemo docker and launch the docker.

$ export IMAGE_NEMO_TRTLLM="nvcr.io/nvidia/nemo:24.07"
$ docker run -itd --gpus all --net=host --ipc=host --privileged --shm-size=512g \
  --ulimit memlock=-1 --ulimit stack=67108864 --name=nemo_trtllm \
  -v /workplace/:/workplace $IMAGE_NEMO_TRTLLM

Run PPO critic server inside docker.

# Inside docker:
$ GPFS="/opt/NeMo-Aligner"
$ CHECKPOINT_NEMO_FILE="[reward-model-checkpoint-path]"
$ RESULTS_DIR="[results-dir]"
$ TP_SIZE=4
$ PP_SIZE=1
$ CRITIC_PORT=5567

# Using 4 GPUs for the critic server.
$ export CUDA_VISIBLE_DEVICES=0,1,2,3

$ export PYTHONPATH="${GPFS}:${PYTHONPATH}" \
&& export HYDRA_FULL_ERROR=1 \
&& python -u ${GPFS}/examples/nlp/gpt/serve_ppo_critic.py \
   exp_manager.create_wandb_logger=False \
   exp_manager.explicit_log_dir=${RESULTS_DIR} \
   exp_manager.wandb_logger_kwargs.name=critic_training \
   exp_manager.wandb_logger_kwargs.project=nemo_aligner_ppo \
   ++model.activations_checkpoint_granularity=null \
   ++model.mcore_gpt=True \
   ++model.megatron_amp_O2=True \
   ++model.offload_adam_states=True \
   ++model.pipeline_model_parallel_size=${PP_SIZE} \
   ++model.tensor_model_parallel_size=${TP_SIZE} \
   ++pretrained_checkpoint.restore_from_path=${CHECKPOINT_NEMO_FILE} \
   trainer.devices=4 \
   trainer.num_nodes=1 \
   trainer.ppo.port=${CRITIC_PORT} \
   trainer.ppo.inference_micro_batch_size=4 \
   ++trainer.ppo.combine_rm_and_critic_server=True

Run training PPO actor inside docker.

# Inside docker:
$ GPFS="/opt/NeMo-Aligner"
$ TRAIN_DATA_PATH="[train-data-path]"
$ VALID_DATA_PATH="[val-data-path]"
$ PRETRAINED_ACTOR_NEMO_FILE="[Llama2-7b-nemo-checkpoint-path]"
$ RESULTS_DIR="[results-dir]"

$ TP_SIZE=4
$ PP_SIZE=1
$ MAX_EPOCHS=1
$ MAX_STEPS=15
$ CRITIC_IP="0.0.0.0"
$ CRITIC_PORT=5567

# Using 4 GPUs for PPO actor training.
$ export CUDA_VISIBLE_DEVICES=4,5,6,7

$ export PYTHONPATH="${GPFS}:${PYTHONPATH}" \
&& export HYDRA_FULL_ERROR=1 \
&& mpirun -n 4 --allow-run-as-root python -u ${GPFS}/examples/nlp/gpt/train_gpt_ppo_actor.py \
   exp_manager.create_wandb_logger=False \
   exp_manager.explicit_log_dir=${RESULTS_DIR} \
   exp_manager.wandb_logger_kwargs.name=ppo_actor_training \
   exp_manager.wandb_logger_kwargs.project=nemo_aligner_ppo \
   +exp_manager.mlflow_logger_kwargs.experiment_name=${USER}-ppo-train-local \
   pretrained_checkpoint.restore_from_path=${PRETRAINED_ACTOR_NEMO_FILE} \
   trainer.devices=4 \
   trainer.num_nodes=1 \
   +trainer.ppo.flask_server.enable=True \
   trainer.ppo.initial_policy_kl_penalty=0.02 \
   trainer.ppo.max_epochs=${MAX_EPOCHS} \
   trainer.ppo.max_steps=${MAX_STEPS} \
   trainer.ppo.trt_llm.enable=True \
   trainer.ppo.trt_llm.reshard=True \
   trainer.ppo.trt_llm.model_type=llama \
   trainer.ppo.trt_llm.unload_engine_train=False \
   trainer.ppo.batch_iterator.use_flask=True \
   trainer.ppo.val_check_interval=3 \
   ++trainer.ppo.normalize_advantages=True \
   ++model.activations_checkpoint_granularity=selective \
   ++model.activations_checkpoint_method=uniform \
   ++model.data.data_impl=jsonl \
   "++model.data.data_prefix={train: [${TRAIN_DATA_PATH}], validation: [${VALID_DATA_PATH}], test: [${VALID_DATA_PATH}]}" \
   ++model.global_batch_size=16 \
   ++model.mcore_gpt=True \
   ++model.megatron_amp_O2=True \
   ++model.micro_batch_size=1 \
   ++model.optim.lr=9e-7 \
   ++model.optim.sched.min_lr=9e-8 \
   ++model.pipeline_model_parallel_size=${PP_SIZE} \
   ++model.ppo.entropy_bonus=0.0 \
   ++model.ppo.length_params.max_length=1024 \
   ++model.ppo.num_rollout_samples=16 \
   ++model.ppo.offload_adam_states=True \
   ++model.ppo.ratio_eps=0.2 \
   ++model.ppo.rollout_micro_batch_size=4 \
   ++model.tensor_model_parallel_size=${TP_SIZE} \
   remote_critic_rm.combine_rm_and_critic_server=True \
   remote_critic_rm.critic.ip=${CRITIC_IP} \
   remote_critic_rm.critic.port=${CRITIC_PORT}

Expected behavior

The expected outcome is that the PPO actor training run with the TensorRT-enabled setup be successful and the training process be able to save the checkpoints during training and validation checkpointing stages without encountering any issues.

Environment overview (please complete the following information)

Environment location: Docker
Method of NeMo-Aligner install: [pip install or from source]. Please specify exact commands you used to install.
If method of install is [Docker], provide docker pull & docker run commands used

$ export IMAGE_NEMO_TRTLLM="nvcr.io/nvidia/nemo:24.07"
$ docker run -itd --gpus all --net=host --ipc=host --privileged --shm-size=512g \
  --ulimit memlock=-1 --ulimit stack=67108864 --name=nemo_trtllm \
  -v /workplace/:/workplace $IMAGE_NEMO_TRTLLM

Environment details

If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:

OS version
PyTorch version
Python version

Additional context

Using 8 NVIDIA A100-SXM4-80GB

NVIDIA / NeMo-Aligner

Error during saving checkpoint with TensorRT-enabled PPO actor training #281