_LOAD_BALANCING_LOSS returns empty list sometimes

Hello, I am using Eleuther AI's gpt-neox implementation with megablocks, but I get 2 errors related to the _LOAD_BALANCING_LOSS.

the tokens_per_expert gives me this error at this line. ValueError: Expected 14 token_per_experts but found 7. Here's the stack trace.

  File "/home/etnguyen/test/savanna/train.py", line 10, in <module>                                                                                                                             [53/1963]
    pretrain(global_config=global_config)                                                                                                                                                                
  File "/home/etnguyen/test/savanna/savanna/training.py", line 228, in pretrain                                                                                                                          
    iteration = train(                                                                                                                                                                                   
                ^^^^^^                                                                                                                                                                                   
  File "/home/etnguyen/test/savanna/savanna/training.py", line 1004, in train                                                                                                                            
    loss_dict, skipped_iter = train_step(                                                                                                                                                                
                              ^^^^^^^^^^^                                                                                                                                                                
  File "/home/etnguyen/test/savanna/savanna/training.py", line 919, in train_step                                                                                                                        
    loss = forward_step(                                                                                                                                                                                 
           ^^^^^^^^^^^^^                                                                                                                                                                                 
  File "/home/etnguyen/test/savanna/savanna/training.py", line 515, in forward_step                                                                                                                      
    moe_loss = mb_moe_loss_func(global_config, loss_mask, outputs)[0]                                                                                                                                    
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                       
  File "/home/etnguyen/test/savanna/savanna/training.py", line 464, in mb_moe_loss_func                                                                                                                  
    lbl = moe.batched_load_balancing_loss(megablocks_args)                                                                                                                                               
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                               
  File "/home/etnguyen/.local/lib/python3.11/site-packages/megablocks/layers/moe.py", line 43, in batched_load_balancing_loss                                                                            
    raise ValueError(                                                                                                                                                                                    
ValueError: Expected 14 token_per_experts but found 7.                                                                                                                                                   
num_layers = 14                                                                                                                                                                                          
pipeline_model_parallel_size = 1                                                                                                                                                                         
num_layers_per_virtual_pipeline_stage = None

I get this error when the expert_interval=2, ie when the default value is used, and so the number of experts is actually half the number of layers (14 layers, 7 Megablocks layers used). This error gets fixed when I set the expert_interval=1 so that there are 14 Megablocks, and 14 layers. But I don't know the root cause of this discrepancy, especially if I want to change the expert_interval and number of Megablocks to every other layer.

The second issue is, let's say I do use expert_interval=1 to get around the issue above, so every layer uses a Megablock, then my next error I get is that the return value of get_load_balancing_loss occasionally returns an empty list, which then errors out, meaning the _LOAD_BALANCING_LOSS is an empty list. Critically, this happens part way through training, like 30 secs in, so some batches it's fine and returns the expected losses.

Does this sound familiar to anybody? I'd very much appreciate any insights, thank you!

databricks / megablocks

_LOAD_BALANCING_LOSS returns empty list sometimes #113