Bad throughput with GLU

databricks / megablocks

Apache License 2.0

1.11k stars 154 forks source link

I'm training models with the below specs but seeing major throughput drop when switching to GLU - Do you know why? / Ideas what I could investigate? Thanks a lot! cc @mvpatel2000 @tgale96

active params: 1,011,613,696 (for glu: 1,280,049,152)
total params: 4,769,710,080 (for glu: 6,917,193,728)
8 H100s, 1 node
FSDP SHARD_GRAD_OP
mlp_impl=grouped
n_experts=8
k=1
micro_bs=1
global_bs=512
no megablocks expert/weight parallelism

With mlp_type=mlp & activation_fn=gelu I get 17000 tokens per second per device.

With mlp_type=glu & activation_fn=silu I get 1000 tokens per second per device.

A small drop is expected as it's slightly more params due to glu, but probably not this large? Switching away from grouped or trying the memory optimized mlp did not help. 🤔

databricks / megablocks

Bad throughput with GLU #110