Computation distribution with expert parallelism

Hi, How are computation/weights sharded when using expert-parallelism with dmoe, does each expert-parallel rank compute for only num-experts/expert-parallelism specific experts, or does each rank compute 1/expert-parallelism of the work for all the experts? Example for the extreme case where expert-parallelism=num-experts and all tokens are routed to a single expert, does all computation happen unevenly on one device in this case or is it somehow sharded across all expert parallel ranks? If it's the latter, is there somewhere to find additional details of how this works? If it's the former, would there be any advantage of using dmoe and not the base implementation for the case where there is 1 expert per rank?

Thanks

databricks / megablocks

Computation distribution with expert parallelism #100