Open Muennighoff opened 1 month ago
@Muennighoff what is your memory usage at? I would guess your memory allocator is thrashing -- this is a common problem close to limit when using dropless MoEs and leads to steep degradation in performance (as opposed to OOM).
To verify this, if you are using composer, you can add the MemoryMonitor callback and watch alloc_retries count. If it spikes, that's bad. If you have your own training library, you can use https://pytorch.org/docs/stable/generated/torch.cuda.memory_stats.html#torch-cuda-memory-stats and look at num_alloc_retries
You can also verify this by decreasing your model size by half on same GPU count (say cut n_layers)
I'm training models with the below specs but seeing major throughput drop when switching to GLU - Do you know why? / Ideas what I could investigate? Thanks a lot! cc @mvpatel2000 @tgale96
With
mlp_type=mlp
&activation_fn=gelu
I get 17000 tokens per second per device.With
mlp_type=glu
&activation_fn=silu
I get 1000 tokens per second per device.A small drop is expected as it's slightly more params due to glu, but probably not this large? Switching away from grouped or trying the memory optimized mlp did not help. 🤔