Unsupported weird communication type?

aliyun / SimAI

Apache License 2.0

82 stars 12 forks source link

sh scripts/megatron_gpt.sh \ --nnodes 1 --node_rank 0 --nproc_per_node 8 --master_addr localhost --master_port 29500 \ -m moe --world_size 8 --tensor_model_parallel_size 4 --pipeline_model_parallel 1 \ --moe_enable --expert_model_parallel_size 1 \ --frame Megatron --global_batch 16 \ --num_experts 4 --moe_router_topk 2 \ --micro_batch 1 --sp --grouped_gemm --aiob_enable --swiglu --use_flash_attn

Moe workload generated by AICB using the following command cannot be parsed:
sh scripts/megatron_gpt.sh \
--nnodes 1 --node_rank 0 --nproc_per_node 8 --master_addr localhost --master_port 29500 \
-m moe --world_size 8 --tensor_model_parallel_size 4 --pipeline_model_parallel 1 \
--moe_enable --expert_model_parallel_size 1  \
--frame Megatron --global_batch 16  \
--num_experts 4 --moe_router_topk 2 \
--micro_batch 1  --sp --grouped_gemm --aiob_enable --swiglu --use_flash_attn 
I suspect this is because the workload contains ALLTOALL_EP communication type, which cannot be parsed in Astrasim. So what's the difference between ALLTOALL and ALLTOALL_EP ? How to fix it?

ALLTOALL refers to the AlltoAll operation within a TP Group, while ALLTOALL_EP denotes the AlltoAll operation within an EP Group. Currently, only SimAI-Analytical supports the parsing of ALLTOALL_EP. You can try using the SimAI-Analytical tool for this purpose.

aliyun / SimAI

Unsupported weird communication type? #9