databricks / megablocks

Apache License 2.0
1.15k stars 164 forks source link

Why not support tensor model parallel? #40

Closed Richie-yan closed 9 months ago

Richie-yan commented 9 months ago

After looking at the code, neither moe nor dmoe support tensor-model-parallel. @tgale96

Richie-yan commented 9 months ago

Does args.moe_weight_parallelism represent tensor-model-parallel?

tgale96 commented 9 months ago

Hi! The weight parallelism argument turns on sharded data parallelism. If you set the expert parallelism arguments such that there is <1 expert per device we'll use tensor parallelism on top of expert parallelism. I hope this helps! Let me know if there are other features you're looking for!

Richie-yan commented 9 months ago

@tgale96 Thanks for your reply, I understand. Does the current code now support the logic of using tensor parallelism when there is <1 expert per device?

Richie-yan commented 9 months ago

The expert_sharding_degree shard the tensor when experts is less than expert_parallel_world_size. Therefore, the current code seems to support tensor parallelism. Is my understanding correct? @tgale96

Richie-yan commented 9 months ago

If I don't want to use expert parallelism and instead want to directly use tensor parallelism, it should be theoretically possible, even though the current code doesn't support it, right?

tgale96 commented 9 months ago

The sharding when expert sharding degree is less than expert parallel world size is expert model parallel sharding. If you don't want to use expert model parallelism and want to use tensor parallelism that would be a feature we'd have to implement separately. There is no theoretical limitation to having something like Megatron-style tensor model parallelism.

Richie-yan commented 9 months ago

Thanks for your reply