databricks / megablocks

Apache License 2.0
1.16k stars 168 forks source link

Running into ValueError when running moe/dmoe scripts #134

Open rtmadduri opened 1 month ago

rtmadduri commented 1 month ago

iteration 1000/ 20000 | consumed samples: 512000 | elapsed time per iteration (ms): 336.3 | learning rate: 1.495E-04 | global batch size: 512 | load balancing loss: 9.743530E-02 | lm loss: 5.181638E+00 | loss scale: 32768.0 | grad norm: 1.003 | number of skipped iterations: 0 | number of nan iterations: 0 |

The training begins and after running for 1000 iterations, I get:

File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/megablocks-0.5.1-py3.9-linux-x86_64.egg/megablocks/layers/moe.py", line 37, in batched_load_balancing_loss ValueError: not enough values to unpack (expected 2, got 0) tokens_per_expert, expert_scores = zip(*get_load_balancing_loss()) ValueError: not enough values to unpack (expected 2, got 0)

mvpatel2000 commented 1 month ago

Is this during eval? Can you provide a minimum repro?

rtmadduri commented 1 month ago

I dont have a repo yet. I am in the process of porting this to work on AMD/ROCm and I run into this issue.

I only run into this error when running dmoe or moe scripts. The Megablocks exp/gpt scripts all work fine. Here is the log from running exp/gpt2_46m_8gpu.sh:


validation loss at iteration 1000 | lm loss value: 5.058095E+00 | lm loss PPL: 1.572906E+02 |

iteration 1100/ 100000 | consumed samples: 563200 | elapsed time per iteration (ms): 365.4 | learning rate: 6.000E-04 | global batch size: 512 | lm loss: 5.032094E+00 | loss scale: 1.0 | grad norm: 1.338 | number of skipped iterations: 0 | number of nan iterations: 0 | iteration 1200/ 100000 | consumed samples: 614400 | elapsed time per iteration (ms): 276.4 | learning rate: 6.000E-04 | global batch size: 512 | lm loss: 4.857907E+00 | loss scale: 1.0 | grad norm: 0.746 | number of skipped iterations: 0 | number of nan iterations: 0 |

Which makes sense because the error can be traced back to moe.py

rtmadduri commented 1 month ago

@mvpatel2000 to answer your question, yes this looks like it happens during eval. I set the eval-interval to 500 and ran into this after 500 iters.

mvpatel2000 commented 1 month ago

Ah, this is because you dont store LBL during eval. You should set model to eval mode. We should give a friendlier error... CC: @eitanturok

rtmadduri commented 1 month ago

Ohh Ok. What changes do I make to the training script? Right now I just run exp/moe/moe_46m_8gpu.sh with the default args and some changes for dataset and stuff.

mvpatel2000 commented 1 month ago

before eval, you'll need to call model.eval(). @eitanturok can you look at tweaking scripts?

rtmadduri commented 2 weeks ago

@mvpatel2000 can you point out to where this change has to be made. I tried playing around with the scripts but could not figure out. Thanks!

rtmadduri commented 2 weeks ago

@eitanturok maybe?

mvpatel2000 commented 2 weeks ago

This might have to happen in third_party/Megatron-LM/pretrain_gpt.py which is the script being called...