Prior to this fix, running bash run_moe_train.sh 1 4 16384 would cause the below error. This commit fixes the problem.
AssertionError: 4 is not divisible by 8
No existing process group found, creating a new group named: ep_size_8
Traceback (most recent call last):
File "./user/train.py", line 681, in <module>
cli_main()
File "./user/train.py", line 674, in cli_main
distributed_utils.call_main(cfg, main)
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/fairseq/distributed/utils.py", line 354, in call_main
distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/fairseq/distributed/utils.py", line 328, in distributed_main
main(cfg, **kwargs)
File "./user/train.py", line 263, in main
tmp_module, _, _, _ = deepspeed.initialize(args=ds_args,
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/__init__.py", line 124, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 288, in __init__
self._configure_distributed_model(model)
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1074, in _configure_distributed_model
module.set_deepspeed_parallelism()
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/moe/layer.py", line 88, in set_deepspeed_parallelism
self._create_process_groups()
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/moe/layer.py", line 99, in _create_process_groups
groups._create_expert_and_data_parallel(self.ep_size)
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/utils/groups.py", line 130, in _create_expert_and_data_parallel
_ensure_divisibility(world_size, expert_parallel_size_)
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/utils/groups.py", line 54, in _ensure_divisibility
assert numerator % denominator == 0, '{} is not divisible by {}'.format(
AssertionError: 4 is not divisible by 8
[2022-10-17 22:55:32,327] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.7.2, git-hash=unknown, git-branch=unknown
No existing process group found, creating a new group named: ep_size_8
Traceback (most recent call last):
File "./user/train.py", line 681, in <module>
cli_main()
File "./user/train.py", line 674, in cli_main
distributed_utils.call_main(cfg, main)
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/fairseq/distributed/utils.py", line 354, in call_main
distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/fairseq/distributed/utils.py", line 328, in distributed_main
main(cfg, **kwargs)
File "./user/train.py", line 263, in main
tmp_module, _, _, _ = deepspeed.initialize(args=ds_args,
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/__init__.py", line 124, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 288, in __init__
self._configure_distributed_model(model)
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1074, in _configure_distributed_model
module.set_deepspeed_parallelism()
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/moe/layer.py", line 88, in set_deepspeed_parallelism
self._create_process_groups()
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/moe/layer.py", line 99, in _create_process_groups
groups._create_expert_and_data_parallel(self.ep_size)
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/utils/groups.py", line 130, in _create_expert_and_data_parallel
_ensure_divisibility(world_size, expert_parallel_size_)
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/utils/groups.py", line 54, in _ensure_divisibility
assert numerator % denominator == 0, '{} is not divisible by {}'.format(
AssertionError: 4 is not divisible by 8
[2022-10-17 22:55:32,341] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.7.2, git-hash=unknown, git-branch=unknown
No existing process group found, creating a new group named: ep_size_8
[2022-10-17 22:55:32,347] [INFO] [logging.py:68:log_dist] [Rank 0] Creating expert and data parallel groups with size 8
Traceback (most recent call last):
File "./user/train.py", line 681, in <module>
cli_main()
File "./user/train.py", line 674, in cli_main
distributed_utils.call_main(cfg, main)
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/fairseq/distributed/utils.py", line 354, in call_main
distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/fairseq/distributed/utils.py", line 328, in distributed_main
main(cfg, **kwargs)
File "./user/train.py", line 263, in main
tmp_module, _, _, _ = deepspeed.initialize(args=ds_args,
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/__init__.py", line 124, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 288, in __init__
self._configure_distributed_model(model)
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1074, in _configure_distributed_model
module.set_deepspeed_parallelism()
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/moe/layer.py", line 88, in set_deepspeed_parallelism
self._create_process_groups()
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/moe/layer.py", line 99, in _create_process_groups
groups._create_expert_and_data_parallel(self.ep_size)
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/utils/groups.py", line 130, in _create_expert_and_data_parallel
_ensure_divisibility(world_size, expert_parallel_size_)
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/utils/groups.py", line 54, in _ensure_divisibility
assert numerator % denominator == 0, '{} is not divisible by {}'.format(
AssertionError: 4 is not divisible by 8
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 51588) of binary: /home/ubuntu/anaconda3/envs/proxy-rnnt/bin/python
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./user/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2022-10-17_22:55:33
host : ip-172-31-36-159.ec2.internal
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 51589)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2022-10-17_22:55:33
host : ip-172-31-36-159.ec2.internal
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 51590)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2022-10-17_22:55:33
host : ip-172-31-36-159.ec2.internal
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 51591)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-10-17_22:55:33
host : ip-172-31-36-159.ec2.internal
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 51588)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Prior to this fix, running
bash run_moe_train.sh 1 4 16384
would cause the below error. This commit fixes the problem.