Closed KaiLv16 closed 3 months ago
The currently supported Expert parallelism adopts the implementation from Megatron-Core. In Megatron-Core, "When using expert parallelism and tensor parallelism, sequence parallelism must be used" will be seen in the link https://github.com/NVIDIA/Megatron-LM/blob/6dbe4cf699880038b1e5cd90b23ee71053c7f2ee/megatron/core/model_parallel_config.py#L333. We will continuously update and enhance various of Expert parallelism in the future.
Apologies for the small error in the README.
The current script supporting workload generation for Moe and expert parallelism is scripts/megatron_gpt.sh and for simAI workload scripts is scripts/megatron_workload_with_aiob.sh . Currently, for Moe, only the alltoalltokendispather
functionalities are supported. The v1.0 version still lacks some aspects for running Moe-related workloads on physical machines, but these issues will be addressed in the upcoming version!
Hi, I also met the same problem. Is there any working example that I can test SimAI? Thanks!
Hi, I also met the same problem. Is there any working example that I can test SimAI? Thanks!
it seems like some bug in utils.py I think it will be helpful to edit the code here https://github.com/aliyun/aicb/blob/cd91399267252cd8cd18bb185c1980606bf0c014/utils/utils.py#L357-L359
add a line if args.moe_enabled :
if args.moe_enabled :
assert (
args.moe_enabled and args.enable_sequence_parallel
), f"moe must be enabled with sequence parallel"
Thank you for pointing out the issue. I will review it and make the necessary fixes in the upcoming version.
Hi, I also met the same problem. Is there any working example that I can test SimAI? Thanks!
it seems like some bug in utils.py I think it will be helpful to edit the code here
add a line if args.moe_enabled :
if args.moe_enabled : assert ( args.moe_enabled and args.enable_sequence_parallel ), f"moe must be enabled with sequence parallel"
That works. Nice job!
Hi,
I was trying to reproduce the results in the Generate Workloads for Simulation (SimAI) section. I ran the recommended commands. However, I encountered the following exception:
Do you have any ideas on how I can fix this?
In addition, I ran another command provided, but it seems to be unable to find the file 'scripts/workload_moe.sh'.
Could this be due to the missing 'workload_moe.sh' file?
Thank you!