Open BaoYu0721 opened 1 year ago
Thanks for raising this issue. It looks like you’re correct and we broke the implementation at some point.
One thing we really need to start doing (but haven’t been able to do due to manpower limitations) is build out a robust testing suite that verifies new major changes don’t break old features :S
Thanks for your reply! I checkout to some other commits, such as the v2.0
release tag and earlier commit when deepspeed_main
is merged into main
(2b84f9af10eebdb82dfb956adc2cb54ba2f62344), and find the plots are similar to the discription above, maybe the bug is introduced even earlier?
I was looking into the muP implementation in gpt-neox
to contrast it with the Megatron-LM
setup and accidentally found this issue :)
I am thinking, could LR schedule be the cause of the problem? By design it overrides the LR values per group (hence overwrites muP changes), and so the way muP scaling was introduced in AnnealingLR()
as rescaling by group["width_mult"]
in step()
here. But I couldn't find that this key was added neither inside mup
optimisers nor in gpt-neox
codebase, so I am not sure that width_mult
rescaling is applied at all.
Also, width_mult
rescaling can be applied only for Adam-like optimisers and matrix-like params (as here), while for SGD the rescaling is with different multipliers, and so should be taken into account.
However, neither I found whether AnnealingLR
schedule is actually applied during the training, so that might well be that my comment isn't really relevant to the observed behaviour.
I think you are right @ofivite, i don't think the implementation was ever correct since the learning rate wasn't correctly setup from the beginning. After a long and thorough debugging I managed to pass at least the 2 basic sanity checks for mup:
Issues I have found are:
Found another bug, neox_args.use_mup
is set to false before initializing models, which also sets their use_mup
attribute to False and therefore always ignore multipliers
And finally there seems to be a bug in the re-initialization of the output layer, after skipping that completely (it should be anyway in the flavour of Table 8) I'm getting these very nice and smooth horizontal lines
@marcobellagente93 Oh yes, now it's indeed nicely flat curves, great ! :)
I'll make a PR as soon as I can
Edit: email formatting did not work properly.
Hi, I’m interested to see the PR as well. When I originally made my PR, the curves looked as expected—flat for mup and blown up for SP.
The use_mup parameter is set to false so the weights never get initialized
Are you referring to it being false by default? When calculating the base shapes using mup, two models have to be instantiated, one with use_mup set to false and the other set to true. If I remember correctly, I set use_mup to false by default everywhere and enabled it only when mup was set to true in the config file. Then, when calculating base shapes it’s forced to false.
The learning rate is not scaled by width_mult
I saw some code being linked to in this thread that’s from the original mup repo from Microsoft. I had to bring in and modify a lot of it. I believe width_mult for the learning rate may have been set there.
What I mean is that the main training loops with mup enabled does the following:
but at step 1 all parameters get initialized with self.use_mup = neox_args.use_mup (which is false) and causes everything else to be wrong (multipliers not used, 1/d attention not used, ...)
This behavior is expected--the weights are reinitialized using
Or do you mean this function does not get called?
Bug Discription & To Reproduce The source code is from current main branch, and follow the instructions in the
README-MUP.md
until this step: I encounter an error like this:This is caused by passing a Bool Tensor into the
get_stat
ofmup
(maybe the attention mask), but themup
library cannot handle it. In addition, we will also encount an error which is caused by passingNone
to theget_stat
.In order to solve this problem temporarily, I modify the source code in the file
coord_check.py
inmup
like this:This time,
coord_check
ran successfully, it outputs many jpgs, one for each GPU, jpg from different GPUs looks very similar, so I just show one jpg for each paramerization.Standard Parameterization:
muP Parameterization:
The result looks weird, the SP is more horizontal than muP, which is not expected.
Expected behavior https://github.com/microsoft/mup#coord-check An expected behavior should looks like the plots in the above link, in which muP is very horizontal, while SP blows up.
Proposed solution I check the code related to
mup
, but don't have a proposal yet, I will try to keep checking it. Maybe contributors in the issue(https://github.com/EleutherAI/gpt-neox/issues/679) can give some comments? @nsarka @Quentin-Anthony @StellaAthena Thanks a lot!Environment (please complete the following information):
I add
mup
related configs into theconfigs/local_setup.yml
, and keep completely identical to the instructions inREADME-MUP.md
.