Open haizadinia opened 2 months ago
Update regarding the save checkpoint issue with TensorRT-enabled PPO training. We tested that falling back to zarr
can bypass the save checkpoint issue with TRT enabled PPO training. However, we are not able to enable the torch_dist
for saving checkpoint, even after switching the multiprocessing
context from spawn
to fork
.
Here is the error log for the setup after switching the multiprocessing
context from spawn
to fork
for saving checkpoints in torch_dist
during TRT enabled PPO training.
[rank0]: File "/opt/apex/apex/contrib/optimizers/distributed_fused_adam.py", line 3198, in start_all_gather
[rank0]: all_gather_into_tensor(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2857, in all_gather_into_tensor
[rank0]: work = group._allgather_base(output_tensor, input_tensor, opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2006, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank0]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank0]: Last error:
[rank0]: Failed to CUDA calloc async 24 bytes
Describe the bug
During the PPO actor training run with TensorRT-enabled, there was an error encountered during the validation checkpointing process. The training was conducted using the TensorRT-LLM setup, as suggested in the documentation available at TRT-LLM Accelerated-RLHF. The latest Nemo docker was used for the experiment.
The issue occurred specifically when the training job was attempting to save the checkpoints while using TensorRT. However, when the PPO actor training was running without the TensorRT-enabled setup, the validation checkpointing process was successful, and the checkpoints were saved without any errors.
This is the error message:
Here is the list of files being saved for checkpoint when the PPO actor training running without the TensorRT-enabled setup:
Additionally, when the PPO actor training was running with TensorRT-enabled but with the validation checkpointing feature disabled, the training process did not encounter any errors. Here is the log of running PPO actor training by disabling the validation checkpointing:
In summary, the error was observed only during the validation checkpointing process when using the TensorRT-enabled setup. The training was successful without the TensorRT-enabled setup or when the validation checkpointing was disabled.
Steps/Code to reproduce bug
To reproduce the bug on a
p4DE with 8 A100 GPUs
. Pull the latest Nemo docker and launch the docker.Run PPO critic server inside docker.
Run training PPO actor inside docker.
Expected behavior
The expected outcome is that the PPO actor training run with the TensorRT-enabled setup be successful and the training process be able to save the checkpoints during training and validation checkpointing stages without encountering any issues.
Environment overview (please complete the following information)
docker pull
&docker run
commands usedEnvironment details
If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:
Additional context
Using 8 NVIDIA A100-SXM4-80GB