Open RobbenRibery opened 2 months ago
Hi @RobbenRibery, that schedule_fn
is a leftover from code we did not use in our experiments (the original L153 in your diff uses float(self.lr
without the negative sign.)
Looking at optax.linear_schedule
it looks like your change should default correctly to a constant function returning the initial learning rate if self.anneal_steps == 0
, so I think this is safe to merge. @samvelyan
Hi @minqi, thanks for your comment! I see your point. We can enforce something like self.anneal_steps == 0
or self.lr_final == self.lr
Happy to run some experiments to see if annealing help further stablise the training.
Hi @RobbenRibery, the default setting for self.anneal_steps
is 0, and for self.lr_final
it is None, in which case it defaults to the same value as self.lr
, so no changes there are necessary.
We previously looked at linear annealing, but found it mostly hurt final policy performance on OOD tasks.
Thanks, appreciated!
Hi Minqi, @minqi, I also find that by setting the following:
export XLA_FLAGS='--xla_gpu_deterministic_ops=true --xla_gpu_autotune_level=0'
export TF_DETERMINISTIC_OPS=1
python -m minimax.train -- .....
I could make the ACCEL runs deterministic at about 20% SPS compared to the non-deterministic runs. Otherwise, even if every RNG split is set correctly, I could still get different results.
ref wand attached:
In the reset(self, rng) method, the learning rate seems negative as initially specified. This triggers the learning to break down completely. After turning it into a positive value, pass the scheduler into the optax chain (see line 153). ACCEL achieves generalisation on OOD envs [ref: WANDB attached]