Efficiency of torch mlp

databricks / megablocks

Apache License 2.0

1.11k stars 154 forks source link

Closed imoneoi closed 6 months ago

imoneoi commented 6 months ago

I've seen a torch mlp branch of megablocks without sparse matrix multiplication here. Curious if it's as efficient as the sparse version

Besides, are there performance comparisons between grouped GEMM and sparse GEMM?

mvpatel2000 commented 6 months ago

It's slower than sparse, especially at high expert to GPU counts. However, it's a useful debug / benchmark tool.

It again depends on expert counts, generally sparse performs better at higher expert counts

imoneoi commented 6 months ago

@mvpatel2000 Thanks for your reply! If expert count = 8 and top k = 2, which implementation is faster? torchmlp, grouped mlp or sparse?