Denys88 / rl_games

RL implementations
MIT License
848 stars 142 forks source link

Potential Issues with Multi-GPU/Node Training with Central Network Weights Initialization #296

Closed annan-tang closed 1 month ago

annan-tang commented 2 months ago

Hi, Thank you for the great work on this project. I have a couple of questions regarding the multi-GPU/multi-node training implementation, specifically in the context of the central network.

From my understanding of the source code, it appears that the initial parameters of the actor_critic model on GPU rank_0 are broadcast to other GPU replicas to ensure they hold the same initial parameters. My questions are as follows:

Is broadcasting the initial parameters of the actor_critic model sufficient to ensure that all GPU replicas maintain the same parameters throughout training? Given that different seeds might be used for each GPU, the central_network could potentially be initialized with different parameter weights. Could this be a potential issue for multi-GPU/multi-node training when using the central_network? I appreciate your time and assistance in addressing these questions