Denys88 / rl_games

RL implementations
MIT License
848 stars 142 forks source link

Deprecate `horovod` in favor of `torch.distributed` #171

Closed vwxyzjn closed 2 years ago

vwxyzjn commented 2 years ago

Follow up with #165 #158

We did a benchmark with isaacgymenvs and torch.distributed shows consistently better scaling performance in AllegroHand.

Notable change - I disable the stats syncing, which does not seem to be that important to average across all workers at every step.

image

You can test it out with

torchrun --standalone --nnodes=1 --nproc_per_node=2 runner.py --train --file rl_games/configs/ppo_cartpole.yaml

CC @ViktorM @markelsanz14

Denys88 commented 2 years ago

@vwxyzjn Ill take a look. Strange but kl divergence syncing was needed to set right LR. Or you mean it is enough to calc it on the rank=0 gpu?

But overall looks awesome and much easier to use :) it takes a few hours of pain to install horovod for average person :)

vwxyzjn commented 2 years ago

Thank you @Denys88

Strange but kl divergence syncing was needed to set right LR. Or you mean it is enough to calc it on the rank=0 gpu?

Yes, it should be enough to call it on rank=0 GPU, al least in the case of isaacgymenvs when thousands of envs are available