alpa-projects / alpa

Training and serving large-scale neural networks with auto parallelization.
https://alpa.ai
Apache License 2.0
3.08k stars 360 forks source link

Unsupported parallel mode in shard-only auto perf test: load_solution #961

Closed LeiWang1999 closed 1 year ago

LeiWang1999 commented 1 year ago

python3 benchmark.py --suite gpt.perf_test_auto --shard-only

error message:

 python3 benchmark.py --suite gpt.perf_test_auto  --shard-only
2023-10-18 02:28:19,917 INFO worker.py:1342 -- Connecting to existing Ray cluster at address: 172.17.0.2:6379...
2023-10-18 02:28:19,934 INFO worker.py:1528 -- Connected to Ray cluster.
Working on case: BenchmarkCase(batch_size=1024, model_config=GPTModelConfig(seq_len=1024, hidden_size=2048, num_layers=24, num_heads=32, vocab_size=51200), num_micro_batches=128, parallel_mode='load_solution', parallel_args=LoadSolutionParallelArgs(prefer_reduce_scatter=True, use_remat=True, num_auto_layers=6, forward_stage_layer_ids=[[0, 1, 2], [3, 4, 5]], submesh_physical_shapes=[(1, 2), (1, 2)], submesh_logical_shapes=[(2, 1), (2, 1)], submesh_autosharding_option_dicts=[{'force_batch_dim_to_mesh_dim': 0}, {'force_batch_dim_to_mesh_dim': 0}]))
2023-10-18 02:28:24,970 INFO worker.py:1342 -- Connecting to existing Ray cluster at address: 172.17.0.2:6379...
2023-10-18 02:28:24,983 INFO worker.py:1528 -- Connected to Ray cluster.
Process SpawnProcess-2:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/v-leiwang3/alpa_workspace/alpa/benchmark/alpa/benchmark_one_case.py", line 144, in benchmark_and_write_to_namespace
    result = benchmark_one_case_internal(*args, **kwargs)
  File "/workspace/v-leiwang3/alpa_workspace/alpa/benchmark/alpa/benchmark_one_case.py", line 53, in benchmark_one_case_internal
    result = benchmark_gpt_bert_2d_internal(
  File "/workspace/v-leiwang3/alpa_workspace/alpa/benchmark/alpa/benchmark_one_case_gpt_bert.py", line 268, in benchmark_gpt_bert_2d_internal
    method, grad_func = get_shard_parallel_method(benchmark_case, physical_mesh)
  File "/workspace/v-leiwang3/alpa_workspace/alpa/benchmark/alpa/benchmark_parallel_utils.py", line 182, in get_shard_parallel_method
    raise ValueError(f"Unsupported parallel mode: {parallel_mode}")
ValueError: Unsupported parallel mode: load_solution

looks like the the shard-only benchmark only support ShardParallelArgs and UniformParallelArgs.

However, in gpt parallels generation, we only support get_search_cases and get_solution_case.

any solutions or concern about this issue?