databricks / megablocks

Apache License 2.0
1.11k stars 154 forks source link

_LOAD_BALANCING_LOSS returns empty list sometimes #113

Open exnx opened 1 month ago

exnx commented 1 month ago

Hello, I am using Eleuther AI's gpt-neox implementation with megablocks, but I get 2 errors related to the _LOAD_BALANCING_LOSS.

  1. the tokens_per_expert gives me this error at this line. ValueError: Expected 14 token_per_experts but found 7. Here's the stack trace.
  File "/home/etnguyen/test/savanna/train.py", line 10, in <module>                                                                                                                             [53/1963]
    pretrain(global_config=global_config)                                                                                                                                                                
  File "/home/etnguyen/test/savanna/savanna/training.py", line 228, in pretrain                                                                                                                          
    iteration = train(                                                                                                                                                                                   
                ^^^^^^                                                                                                                                                                                   
  File "/home/etnguyen/test/savanna/savanna/training.py", line 1004, in train                                                                                                                            
    loss_dict, skipped_iter = train_step(                                                                                                                                                                
                              ^^^^^^^^^^^                                                                                                                                                                
  File "/home/etnguyen/test/savanna/savanna/training.py", line 919, in train_step                                                                                                                        
    loss = forward_step(                                                                                                                                                                                 
           ^^^^^^^^^^^^^                                                                                                                                                                                 
  File "/home/etnguyen/test/savanna/savanna/training.py", line 515, in forward_step                                                                                                                      
    moe_loss = mb_moe_loss_func(global_config, loss_mask, outputs)[0]                                                                                                                                    
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                       
  File "/home/etnguyen/test/savanna/savanna/training.py", line 464, in mb_moe_loss_func                                                                                                                  
    lbl = moe.batched_load_balancing_loss(megablocks_args)                                                                                                                                               
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                               
  File "/home/etnguyen/.local/lib/python3.11/site-packages/megablocks/layers/moe.py", line 43, in batched_load_balancing_loss                                                                            
    raise ValueError(                                                                                                                                                                                    
ValueError: Expected 14 token_per_experts but found 7.                                                                                                                                                   
num_layers = 14                                                                                                                                                                                          
pipeline_model_parallel_size = 1                                                                                                                                                                         
num_layers_per_virtual_pipeline_stage = None  

I get this error when the expert_interval=2, ie when the default value is used, and so the number of experts is actually half the number of layers (14 layers, 7 Megablocks layers used). This error gets fixed when I set the expert_interval=1 so that there are 14 Megablocks, and 14 layers. But I don't know the root cause of this discrepancy, especially if I want to change the expert_interval and number of Megablocks to every other layer.

  1. The second issue is, let's say I do use expert_interval=1 to get around the issue above, so every layer uses a Megablock, then my next error I get is that the return value of get_load_balancing_loss occasionally returns an empty list, which then errors out, meaning the _LOAD_BALANCING_LOSS is an empty list. Critically, this happens part way through training, like 30 secs in, so some batches it's fine and returns the expected losses.

Does this sound familiar to anybody? I'd very much appreciate any insights, thank you!

mvpatel2000 commented 1 month ago
  1. Can you double check num_layers is appropriately passed in Arguments to dmoe/moe when expert_interval=2? It should be equal to 7.
  2. Hm... I'm a little less sure since I'm not as familiar with Eleuther's harness. You can check out an example with LLMFoundry here if helpful. I'm not sure what's going on here... it should save every time you do a forward pass 🤔. Could you ensure it is in train mode? If the model is in eval mode, it won't save LBL loss, and then if you try to backprop it will error