databricks / megablocks

Apache License 2.0
1.11k stars 154 forks source link

Bad throughput with GLU #110

Open Muennighoff opened 1 month ago

Muennighoff commented 1 month ago

I'm training models with the below specs but seeing major throughput drop when switching to GLU - Do you know why? / Ideas what I could investigate? Thanks a lot! cc @mvpatel2000 @tgale96

active params: 1,011,613,696 (for glu: 1,280,049,152)
total params: 4,769,710,080 (for glu: 6,917,193,728)
8 H100s, 1 node
FSDP SHARD_GRAD_OP
mlp_impl=grouped
n_experts=8
k=1
micro_bs=1
global_bs=512
no megablocks expert/weight parallelism

With mlp_type=mlp & activation_fn=gelu I get 17000 tokens per second per device.

With mlp_type=glu & activation_fn=silu I get 1000 tokens per second per device.

A small drop is expected as it's slightly more params due to glu, but probably not this large? Switching away from grouped or trying the memory optimized mlp did not help. 🤔

mvpatel2000 commented 1 month ago

@Muennighoff what is your memory usage at? I would guess your memory allocator is thrashing -- this is a common problem close to limit when using dropless MoEs and leads to steep degradation in performance (as opposed to OOM).

To verify this, if you are using composer, you can add the MemoryMonitor callback and watch alloc_retries count. If it spikes, that's bad. If you have your own training library, you can use https://pytorch.org/docs/stable/generated/torch.cuda.memory_stats.html#torch-cuda-memory-stats and look at num_alloc_retries

You can also verify this by decreasing your model size by half on same GPU count (say cut n_layers)