aliyun / SimAI

Apache License 2.0
82 stars 12 forks source link

Unsupported weird communication type? #9

Open maoshunyu opened 2 weeks ago

maoshunyu commented 2 weeks ago

Moe workload generated by AICB using the following command cannot be parsed:

sh scripts/megatron_gpt.sh \
--nnodes 1 --node_rank 0 --nproc_per_node 8 --master_addr localhost --master_port 29500 \
-m moe --world_size 8 --tensor_model_parallel_size 4 --pipeline_model_parallel 1 \
--moe_enable --expert_model_parallel_size 1  \
--frame Megatron --global_batch 16  \
--num_experts 4 --moe_router_topk 2 \
--micro_batch 1  --sp --grouped_gemm --aiob_enable --swiglu --use_flash_attn 

I suspect this is because the workload contains ALLTOALL_EP communication type, which cannot be parsed in Astrasim. So what's the difference between ALLTOALL and ALLTOALL_EP ? How to fix it?

Huoyuan100861 commented 2 weeks ago

Moe workload generated by AICB using the following command cannot be parsed:

sh scripts/megatron_gpt.sh \
--nnodes 1 --node_rank 0 --nproc_per_node 8 --master_addr localhost --master_port 29500 \
-m moe --world_size 8 --tensor_model_parallel_size 4 --pipeline_model_parallel 1 \
--moe_enable --expert_model_parallel_size 1  \
--frame Megatron --global_batch 16  \
--num_experts 4 --moe_router_topk 2 \
--micro_batch 1  --sp --grouped_gemm --aiob_enable --swiglu --use_flash_attn 

I suspect this is because the workload contains ALLTOALL_EP communication type, which cannot be parsed in Astrasim. So what's the difference between ALLTOALL and ALLTOALL_EP ? How to fix it?

ALLTOALL refers to the AlltoAll operation within a TP Group, while ALLTOALL_EP denotes the AlltoAll operation within an EP Group. Currently, only SimAI-Analytical supports the parsing of ALLTOALL_EP. You can try using the SimAI-Analytical tool for this purpose.