Running into ValueError when running moe/dmoe scripts

databricks / megablocks

Apache License 2.0

1.16k stars 168 forks source link

Running into ValueError when running moe/dmoe scripts #134

Open rtmadduri opened 1 month ago

rtmadduri commented 1 month ago

The training begins and after running for 1000 iterations, I get:

File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/megablocks-0.5.1-py3.9-linux-x86_64.egg/megablocks/layers/moe.py", line 37, in batched_load_balancing_loss ValueError: not enough values to unpack (expected 2, got 0) tokens_per_expert, expert_scores = zip(*get_load_balancing_loss()) ValueError: not enough values to unpack (expected 2, got 0)

mvpatel2000 commented 1 month ago

Is this during eval? Can you provide a minimum repro?

rtmadduri commented 1 month ago

I dont have a repo yet. I am in the process of porting this to work on AMD/ROCm and I run into this issue.

I only run into this error when running dmoe or moe scripts. The Megablocks exp/gpt scripts all work fine. Here is the log from running exp/gpt2_46m_8gpu.sh:

validation loss at iteration 1000 | lm loss value: 5.058095E+00 | lm loss PPL: 1.572906E+02 |

Which makes sense because the error can be traced back to moe.py

rtmadduri commented 1 month ago

@mvpatel2000 to answer your question, yes this looks like it happens during eval. I set the eval-interval to 500 and ran into this after 500 iters.

mvpatel2000 commented 1 month ago

Ah, this is because you dont store LBL during eval. You should set model to eval mode. We should give a friendlier error... CC: @eitanturok

rtmadduri commented 1 month ago

Ohh Ok. What changes do I make to the training script? Right now I just run exp/moe/moe_46m_8gpu.sh with the default args and some changes for dataset and stuff.

mvpatel2000 commented 1 month ago

before eval, you'll need to call model.eval(). @eitanturok can you look at tweaking scripts?

rtmadduri commented 2 weeks ago

@mvpatel2000 can you point out to where this change has to be made. I tried playing around with the scripts but could not figure out. Thanks!

rtmadduri commented 2 weeks ago

@eitanturok maybe?

mvpatel2000 commented 2 weeks ago

This might have to happen in third_party/Megatron-LM/pretrain_gpt.py which is the script being called...