aliyun / aicb

Other
145 stars 21 forks source link

Assertion Failed in AICB Tutorial Demo #18

Closed Nngxu closed 2 weeks ago

Nngxu commented 2 weeks ago

When I running AICB Tuturial Demo: Quick start for single-node execution

export MASTER_ADDR=127.0.0.1
export MASTER_PORT=23089
export WORLD_SIZE=1
export RANK=0

sh ./scripts/megatron_gpt.sh \
-m 13 --world_size 8 --tensor_model_parallel_size 8 --pipeline_model_parallel 1 \
--frame Megatron --global_batch 2  \
--micro_batch 1 --seq_length 4096 \
--swiglu --use_flash_attn  --aiob_enable  

I get following Assertion Failed:

Traceback (most recent call last):
  File "/workspace/AICBench/./aicb.py", line 29, in <module>
    args = get_args()
  File "/workspace/AICBench/utils/utils.py", line 609, in get_args
    ARGS = get_params()
  File "/workspace/AICBench/utils/utils.py", line 556, in get_params
    args.world_size % (args.tensor_model_parallel_size * args.pipeline_model_parallel) == 0
AssertionError: world size: 1, tp: 8, pp: 1
[2024-11-24 12:10:15,684] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 409) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 351, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
./aicb.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-24_12:10:15
  host      : 004b83f919d2
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 409)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

This is caused by setting the environment WORLD_SIZE =1, we need to modify here to 8.

zhouheyang-alibaba commented 2 weeks ago

The above feature primarily utilizes the physical flow generation capability. The use of AICB's physical flow generation is mainly targeted at AI clusters, typically involving single-node with 8 GPUs. Therefore, we have not specifically considered the case of a single-node with a single GPU. We also recommend running the physical flow generation on AI clusters.

Nngxu commented 2 weeks ago

I appreciate your response.