Open maoshunyu opened 2 weeks ago
Moe workload generated by AICB using the following command cannot be parsed:
sh scripts/megatron_gpt.sh \ --nnodes 1 --node_rank 0 --nproc_per_node 8 --master_addr localhost --master_port 29500 \ -m moe --world_size 8 --tensor_model_parallel_size 4 --pipeline_model_parallel 1 \ --moe_enable --expert_model_parallel_size 1 \ --frame Megatron --global_batch 16 \ --num_experts 4 --moe_router_topk 2 \ --micro_batch 1 --sp --grouped_gemm --aiob_enable --swiglu --use_flash_attn
I suspect this is because the workload contains
ALLTOALL_EP
communication type, which cannot be parsed in Astrasim. So what's the difference betweenALLTOALL
andALLTOALL_EP
? How to fix it?
ALLTOALL refers to the AlltoAll operation within a TP Group, while ALLTOALL_EP denotes the AlltoAll operation within an EP Group. Currently, only SimAI-Analytical supports the parsing of ALLTOALL_EP. You can try using the SimAI-Analytical tool for this purpose.
Moe workload generated by AICB using the following command cannot be parsed:
I suspect this is because the workload contains
ALLTOALL_EP
communication type, which cannot be parsed in Astrasim. So what's the difference betweenALLTOALL
andALLTOALL_EP
? How to fix it?