Open rtmadduri opened 1 month ago
Is this during eval? Can you provide a minimum repro?
I dont have a repo yet. I am in the process of porting this to work on AMD/ROCm and I run into this issue.
I only run into this error when running dmoe or moe scripts. The Megablocks exp/gpt scripts all work fine. Here is the log from running exp/gpt2_46m_8gpu.sh:
iteration 1100/ 100000 | consumed samples: 563200 | elapsed time per iteration (ms): 365.4 | learning rate: 6.000E-04 | global batch size: 512 | lm loss: 5.032094E+00 | loss scale: 1.0 | grad norm: 1.338 | number of skipped iterations: 0 | number of nan iterations: 0 | iteration 1200/ 100000 | consumed samples: 614400 | elapsed time per iteration (ms): 276.4 | learning rate: 6.000E-04 | global batch size: 512 | lm loss: 4.857907E+00 | loss scale: 1.0 | grad norm: 0.746 | number of skipped iterations: 0 | number of nan iterations: 0 |
Which makes sense because the error can be traced back to moe.py
@mvpatel2000 to answer your question, yes this looks like it happens during eval. I set the eval-interval to 500 and ran into this after 500 iters.
Ah, this is because you dont store LBL during eval. You should set model to eval mode. We should give a friendlier error... CC: @eitanturok
Ohh Ok. What changes do I make to the training script? Right now I just run exp/moe/moe_46m_8gpu.sh with the default args and some changes for dataset and stuff.
before eval, you'll need to call model.eval()
. @eitanturok can you look at tweaking scripts?
@mvpatel2000 can you point out to where this change has to be made. I tried playing around with the scripts but could not figure out. Thanks!
@eitanturok maybe?
This might have to happen in third_party/Megatron-LM/pretrain_gpt.py
which is the script being called...
iteration 1000/ 20000 | consumed samples: 512000 | elapsed time per iteration (ms): 336.3 | learning rate: 1.495E-04 | global batch size: 512 | load balancing loss: 9.743530E-02 | lm loss: 5.181638E+00 | loss scale: 32768.0 | grad norm: 1.003 | number of skipped iterations: 0 | number of nan iterations: 0 |
The training begins and after running for 1000 iterations, I get:
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/megablocks-0.5.1-py3.9-linux-x86_64.egg/megablocks/layers/moe.py", line 37, in batched_load_balancing_loss ValueError: not enough values to unpack (expected 2, got 0) tokens_per_expert, expert_scores = zip(*get_load_balancing_loss()) ValueError: not enough values to unpack (expected 2, got 0)