databricks / megablocks

Apache License 2.0
1.17k stars 169 forks source link

Computation distribution with expert parallelism #100

Closed opherlieber closed 6 months ago

opherlieber commented 6 months ago

Hi, How are computation/weights sharded when using expert-parallelism with dmoe, does each expert-parallel rank compute for only num-experts/expert-parallelism specific experts, or does each rank compute 1/expert-parallelism of the work for all the experts? Example for the extreme case where expert-parallelism=num-experts and all tokens are routed to a single expert, does all computation happen unevenly on one device in this case or is it somehow sharded across all expert parallel ranks? If it's the latter, is there somewhere to find additional details of how this works? If it's the former, would there be any advantage of using dmoe and not the base implementation for the case where there is 1 expert per rank?

Thanks

mvpatel2000 commented 6 months ago

With expert parallel, each rank only computes a fraction of the experts. For example, if there are 16 experts and expert world size 8, each rank would compute 2 experts. If all tokens are routed to 1 expert, the routing is indeed uneven and one GPU would receive all the work.