Closed jaywonchung closed 1 year ago
I tried the HEAD commit (20debbe5f0ed4047d82ae615cb2c07b059498032) and now the attribute error & segfault are gone. Just the identical Failed check
error lingers.
I notice that the result of compilation (autosharding_option_dicts
) is different.
If you want to use advanced parallelization options. Please refer to this OPT example https://github.com/alpa-projects/alpa/tree/main/examples/opt_finetune and this branch https://github.com/alpa-projects/alpa/pull/858
Please describe the bug I'm trying to use
PipeshardParallel
for the GPT2 example inexamples/gpt2
(20debbe5f0ed4047d82ae615cb2c07b059498032) with Alpa v0.2.2 inside a Docker container. I'm on an RHEL node with four NVIDIA A40 GPUs.check failed: strategies->is_tuple || !strategies->leaf_vector.empty() %pad.38 = f16[8,512,2304]{2,1,0} pad(f16[8,512,768]{2,1,0} %reshape.1367, f16[] %constant.1168), padding=0_0x0_0x1536_0, metadata={op_name="parallelize(stage_0_1)/jit(main)/jit(merged)/jit(stage_0_1_compute2)/transpose(jvp(FlaxGPT2LMHeadModule))/transformer/h/11/attn/pad[padding_config=((0, 0, 0), (0, 0, 0), (1536, 0, 0))]" source_file="/opt/conda/envs/alpa/lib/python3.8/site-packages/transformers/models/gpt2/modeling_flax_gpt2.py" source_line=211} does not have any valid strategies.
AttributeError: module 'jaxlib.xla_extension' has no attribute 'nccl_create_communicators_no_stream
Please describe the expected behavior
System information and environment
To Reproduce Steps to reproduce the behavior:
docker/coreweave/run_alpa_infiniband.Dockerfile
. All following commands done inside container.git clone --recursive https://github.com/alpa-projects/alpa.git
cd alpa/examples/gpt2
run_clm_flax.py
so that it usesPipeshardParallel
instead ofZero2Parallel
:pip install transformers datasets
(transformers 4.25.1, datasets 2.8.0)export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
pip install tensorflow
mkdir norwegian-gpt2 && python train_tokenizer.py && python create_config.py
Full error output (tqdm disabled)
``` INFO:__main__:***** Running training ***** INFO:__main__: Num examples = 1966029 INFO:__main__: Num Epochs = 20 INFO:__main__: Batch size per device (w. accumulation) = 32 INFO:__main__: Global train batch size (w. parallel & distributed) = 128 INFO:__main__: Total optimization steps = 307180 Initial compilation. This might take some minutes... -------------------- Automatic stage clustering -------------------- submesh_choices: ((1, 1), (1, 2), (1, 4)) - Profiling for submesh 2 (1, 4): - Generate all stage infos (Jaxpr -> HLO) - Compile all stages (CompileWorker pid=73427) 2023-01-19 21:55:05.286251: F external/org_tensorflow/tensorflow/compiler/xla/service/spmd/auto_sharding.cc:1465] Check failed: strategies->is_tuple || !strategies->leaf_vector.empty() %pad.38 = f16[8,512,2304]{2,1,0} pad(f16[8,512,768]{2,1,0} %reshape.1367, f16[] %constant.1168), padding=0_0x0_0x1536_0, metadata={op_name="parallelize(stage_0_1)/jit(main)/jit(merged)/jit(stage_0_1_compute2)/transpose(jvp(FlaxGPT2LMHeadModule))/transformer/h/11/attn/pad[padding_config=((0, 0, 0), (0, 0, 0), (1536, 0, 0))]" source_file="/opt/conda/envs/alpa/lib/python3.8/site-packages/transformers/models/gpt2/modeling_flax_gpt2.py" source_line=211} does not have any valid strategies. (CompileWorker pid=73427) *** SIGABRT received at time=1674165305 on cpu 46 *** (CompileWorker pid=73427) PC: @ 0x7f698dbd000b (unknown) raise (CompileWorker pid=73427) @ 0x7f698deed420 537164224 (unknown) (CompileWorker pid=73427) @ 0x7f42c9fb207d 10592 xla::spmd::BuildStrategyAndCost() (CompileWorker pid=73427) @ 0x7f42cb6ce3b4 2368 xla::spmd::AutoSharding::Run() (CompileWorker pid=73427) @ 0x7f42cdf7f371 816 xla::HloPassPipeline::RunPassesInternal<>() (CompileWorker pid=73427) @ 0x7f42cdf7ffc5 448 xla::HloPassPipeline::Run() (CompileWorker pid=73427) @ 0x7f42ca52cf24 80 xla::HloPassInterface::Run() (CompileWorker pid=73427) @ 0x7f42ca536391 4128 xla::spmd::RunAutoShardingPass() (CompileWorker pid=73427) @ 0x7f42ca52215a 160 pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN() (CompileWorker pid=73427) @ 0x7f42ca392730 576 pybind11::cpp_function::dispatcher() (CompileWorker pid=73427) @ 0x4e1172 (unknown) PyCFunction_Call (CompileWorker pid=73427) @ 0x71a560 (unknown) (unknown) (CompileWorker pid=73427) [2023-01-19 21:55:05,324 E 73427 73427] logging.cc:361: *** SIGABRT received at time=1674165305 on cpu 46 *** (CompileWorker pid=73427) [2023-01-19 21:55:05,324 E 73427 73427] logging.cc:361: PC: @ 0x7f698dbd000b (unknown) raise (CompileWorker pid=73427) [2023-01-19 21:55:05,325 E 73427 73427] logging.cc:361: @ 0x7f698deed420 537164224 (unknown) (CompileWorker pid=73427) [2023-01-19 21:55:05,325 E 73427 73427] logging.cc:361: @ 0x7f42c9fb207d 10592 xla::spmd::BuildStrategyAndCost() (CompileWorker pid=73427) [2023-01-19 21:55:05,325 E 73427 73427] logging.cc:361: @ 0x7f42cb6ce3b4 2368 xla::spmd::AutoSharding::Run() (CompileWorker pid=73427) [2023-01-19 21:55:05,325 E 73427 73427] logging.cc:361: @ 0x7f42cdf7f371 816 xla::HloPassPipeline::RunPassesInternal<>() (CompileWorker pid=73427) [2023-01-19 21:55:05,325 E 73427 73427] logging.cc:361: @ 0x7f42cdf7ffc5 448 xla::HloPassPipeline::Run() (CompileWorker pid=73427) [2023-01-19 21:55:05,325 E 73427 73427] logging.cc:361: @ 0x7f42ca52cf24 80 xla::HloPassInterface::Run() (CompileWorker pid=73427) [2023-01-19 21:55:05,325 E 73427 73427] logging.cc:361: @ 0x7f42ca536391 4128 xla::spmd::RunAutoShardingPass() (CompileWorker pid=73427) [2023-01-19 21:55:05,325 E 73427 73427] logging.cc:361: @ 0x7f42ca52215a 160 pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN() (CompileWorker pid=73427) [2023-01-19 21:55:05,325 E 73427 73427] logging.cc:361: @ 0x7f42ca392730 576 pybind11::cpp_function::dispatcher() (CompileWorker pid=73427) [2023-01-19 21:55:05,325 E 73427 73427] logging.cc:361: @ 0x4e1172 (unknown) PyCFunction_Call (CompileWorker pid=73427) [2023-01-19 21:55:05,326 E 73427 73427] logging.cc:361: @ 0x71a560 (unknown) (unknown) (CompileWorker pid=73427) Fatal Python error: Aborted (CompileWorker pid=73427) (CompileWorker pid=73427) Stack (most recent call first): (CompileWorker pid=73427) File "/opt/conda/envs/alpa/lib/python3.8/site-packages/alpa/shard_parallel/auto_sharding.py", line 344 in run_auto_sharding_pass (CompileWorker pid=73427) File "/opt/conda/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/stage_profiling.py", line 161 in compile_stage_for_profiling (CompileWorker pid=73427) File "/opt/conda/envs/alpa/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 466 in _resume_span (CompileWorker pid=73427) File "/opt/conda/envs/alpa/lib/python3.8/site-packages/ray/_private/function_manager.py", line 674 in actor_method_executor (CompileWorker pid=73427) File "/opt/conda/envs/alpa/lib/python3.8/site-packages/ray/_private/worker.py", line 763 in main_loop (CompileWorker pid=73427) File "/opt/conda/envs/alpa/lib/python3.8/site-packages/ray/_private/workers/default_worker.py", line 231 inAs a side note, it would be great if there's a single Dockerfile to compile and run the Alpa HEAD commit.