Refacotor optimizer step logic

vwxyzjn commented 2 years ago

The current implementation is problematic when not doing gradient truncation. The following filediff shows the lines that are needed to add.

     if self.truncate_grads:
         if self.multi_gpu:
             self.optimizer.synchronize()
             self.scaler.unscale_(self.optimizer)
             nn.utils.clip_grad_norm_(self.model.parameters(), self.grad_norm)
             with self.optimizer.skip_synchronize():
                 self.scaler.step(self.optimizer)
                 self.scaler.update()
         else:
             self.scaler.unscale_(self.optimizer)
             nn.utils.clip_grad_norm_(self.model.parameters(), self.grad_norm)
             self.scaler.step(self.optimizer)
             self.scaler.update()
     else:
-        self.scaler.step(self.optimizer)
-        self.scaler.update()
+        if self.multi_gpu:
+            with self.optimizer.skip_synchronize():
+                self.scaler.step(self.optimizer)
+                self.scaler.update()
+        else:
+            self.scaler.step(self.optimizer)
+            self.scaler.update()

But this is pretty ugly... so I refactored it to

        if self.multi_gpu:
            self.optimizer.synchronize()

        if self.truncate_grads:
            self.scaler.unscale_(self.optimizer)
            nn.utils.clip_grad_norm_(self.model.parameters(), self.grad_norm)

        if self.multi_gpu:
            with self.optimizer.skip_synchronize():
                self.scaler.step(self.optimizer)
                self.scaler.update()
        else:
            self.scaler.step(self.optimizer)
            self.scaler.update()

Denys88 commented 2 years ago

Yeah I even had a comment about it in the code :) Maybe lets remove mixed precision too? we canget rid of scaler. Anyway it doesnt help. Do you want to apply this change before or after horovod replacement?

vwxyzjn commented 2 years ago

Maybe it's probably better to merge this PR as is? We can then remove mixed-precision in a PR that involves docs change and could impact downstream projects such as isaacgymenvs.

Denys88 commented 2 years ago

Okay! Ill review all other prs after my work in the evening.

vwxyzjn commented 2 years ago

Also pinging @markelsanz14, are you ok with me filing a PR removing mixed-precision support? I know you were thinking about using it in the future.

markelsanz14 commented 2 years ago

@Denys88 do we know how much removing mixed precision will affect different model sizes? I agree in most cases it won't help, mainly when we have 2-5 layer NNs, but it might speed up training with larger NN sizes.

vwxyzjn commented 2 years ago

Hey @Denys88, @ViktorM @markelsanz14 and I had a quick chat about this, and maybe it's worth it to keep the mixed-precision option? I did a quick benchmark and found virtually no performance differences

However, as @markelsanz14 suggested maybe in large NNs mixed-precision could offer benefits.

markelsanz14 commented 2 years ago

Yeah, I wouldn't put more work into it for now, but removing it altogether seems a bit drastic. I feel like it can be useful in the future with larger NN architectures.

vwxyzjn commented 2 years ago

More data on this, removing the scaler seems to make it less than 5% faster (178 seconds vs 186 seconds), so it could just bit a bit of randomness.

Denys88 commented 2 years ago

could you check it on some heave networks, including transformers. I have my IG fork https://github.com/Denys88/IsaacGymEnvs with some fat neural networks like openai embedding from the paper or just transformer. !https://github.com/Denys88/IsaacGymEnvs/blob/main/isaacgymenvs/cfg/train/AntPPO_tr.yaml

Denys88 / rl_games

Refacotor optimizer step logic #163