Open XiaoYunZhou27 opened 3 years ago
I have the same confusion. What's more, alpha is a function with global_step, so when batch_size change, the step of every Epoch is also change. But in the paper, it said that alpha was relative with ramp up epoch.
In the paper, it is said alpha should be 0.99 at the beginning (when global_step is small) and should be 0.999 at the end (when global_step is large), however, in the code:
alpha = min(1 - 1 / (global_step + 1), alpha)
following this, alpha is 0 when global_step is small, and is alpha (this is set as 0.99 from parameters) when global_step is >99. The code seems different what the paper presented. The paper indicates a code of
alpha = max(1 - 1 / (global_step + 1), alpha)
does anyone find issues here?