Closed imoneoi closed 6 months ago
It's slower than sparse, especially at high expert to GPU counts. However, it's a useful debug / benchmark tool.
It again depends on expert counts, generally sparse performs better at higher expert counts
@mvpatel2000 Thanks for your reply! If expert count = 8 and top k = 2, which implementation is faster? torchmlp, grouped mlp or sparse?
I've seen a torch mlp branch of megablocks without sparse matrix multiplication here. Curious if it's as efficient as the sparse version
Besides, are there performance comparisons between grouped GEMM and sparse GEMM?