facebookresearch / FAMBench

Benchmarks to capture important workloads.
Apache License 2.0
28 stars 23 forks source link

Fix MOE bash vars EP_WORLD_SIZE & NUM_EXPERTS for under 8 GPUs #103

Closed samiwilf closed 1 year ago

samiwilf commented 1 year ago

Prior to this fix, running bash run_moe_train.sh 1 4 16384 would cause the below error. This commit fixes the problem.

AssertionError: 4 is not divisible by 8
No existing process group found, creating a new group named: ep_size_8
Traceback (most recent call last):
  File "./user/train.py", line 681, in <module>
    cli_main()
  File "./user/train.py", line 674, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/fairseq/distributed/utils.py", line 354, in call_main
    distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/fairseq/distributed/utils.py", line 328, in distributed_main
    main(cfg, **kwargs)
  File "./user/train.py", line 263, in main
    tmp_module, _, _, _ = deepspeed.initialize(args=ds_args,
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/__init__.py", line 124, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 288, in __init__
    self._configure_distributed_model(model)
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1074, in _configure_distributed_model
    module.set_deepspeed_parallelism()
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/moe/layer.py", line 88, in set_deepspeed_parallelism
    self._create_process_groups()
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/moe/layer.py", line 99, in _create_process_groups
    groups._create_expert_and_data_parallel(self.ep_size)
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/utils/groups.py", line 130, in _create_expert_and_data_parallel
    _ensure_divisibility(world_size, expert_parallel_size_)
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/utils/groups.py", line 54, in _ensure_divisibility
    assert numerator % denominator == 0, '{} is not divisible by {}'.format(
AssertionError: 4 is not divisible by 8
[2022-10-17 22:55:32,327] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.7.2, git-hash=unknown, git-branch=unknown
No existing process group found, creating a new group named: ep_size_8
Traceback (most recent call last):
  File "./user/train.py", line 681, in <module>
    cli_main()
  File "./user/train.py", line 674, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/fairseq/distributed/utils.py", line 354, in call_main
    distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/fairseq/distributed/utils.py", line 328, in distributed_main
    main(cfg, **kwargs)
  File "./user/train.py", line 263, in main
    tmp_module, _, _, _ = deepspeed.initialize(args=ds_args,
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/__init__.py", line 124, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 288, in __init__
    self._configure_distributed_model(model)
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1074, in _configure_distributed_model
    module.set_deepspeed_parallelism()
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/moe/layer.py", line 88, in set_deepspeed_parallelism
    self._create_process_groups()
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/moe/layer.py", line 99, in _create_process_groups
    groups._create_expert_and_data_parallel(self.ep_size)
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/utils/groups.py", line 130, in _create_expert_and_data_parallel
    _ensure_divisibility(world_size, expert_parallel_size_)
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/utils/groups.py", line 54, in _ensure_divisibility
    assert numerator % denominator == 0, '{} is not divisible by {}'.format(
AssertionError: 4 is not divisible by 8
[2022-10-17 22:55:32,341] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.7.2, git-hash=unknown, git-branch=unknown
No existing process group found, creating a new group named: ep_size_8
[2022-10-17 22:55:32,347] [INFO] [logging.py:68:log_dist] [Rank 0] Creating expert and data parallel groups with size 8
Traceback (most recent call last):
  File "./user/train.py", line 681, in <module>
    cli_main()
  File "./user/train.py", line 674, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/fairseq/distributed/utils.py", line 354, in call_main
    distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/fairseq/distributed/utils.py", line 328, in distributed_main
    main(cfg, **kwargs)
  File "./user/train.py", line 263, in main
    tmp_module, _, _, _ = deepspeed.initialize(args=ds_args,
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/__init__.py", line 124, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 288, in __init__
    self._configure_distributed_model(model)
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1074, in _configure_distributed_model
    module.set_deepspeed_parallelism()
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/moe/layer.py", line 88, in set_deepspeed_parallelism
    self._create_process_groups()
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/moe/layer.py", line 99, in _create_process_groups
    groups._create_expert_and_data_parallel(self.ep_size)
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/utils/groups.py", line 130, in _create_expert_and_data_parallel
    _ensure_divisibility(world_size, expert_parallel_size_)
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/deepspeed/utils/groups.py", line 54, in _ensure_divisibility
    assert numerator % denominator == 0, '{} is not divisible by {}'.format(
AssertionError: 4 is not divisible by 8
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 51588) of binary: /home/ubuntu/anaconda3/envs/proxy-rnnt/bin/python
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/anaconda3/envs/proxy-rnnt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./user/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-10-17_22:55:33
  host      : ip-172-31-36-159.ec2.internal
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 51589)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2022-10-17_22:55:33
  host      : ip-172-31-36-159.ec2.internal
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 51590)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2022-10-17_22:55:33
  host      : ip-172-31-36-159.ec2.internal
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 51591)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-10-17_22:55:33
  host      : ip-172-31-36-159.ec2.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 51588)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================