alpa-projects / alpa

Training and serving large-scale neural networks with auto parallelization.
https://alpa.ai
Apache License 2.0
3.07k stars 355 forks source link

OSError: [Errno 28] No space left on device #763

Closed sunilitggu closed 1 year ago

sunilitggu commented 1 year ago

Please describe the bug Trying to train a GPT3 6.7B parameters model using the code https://github.com/alpa-projects/alpa/blob/main/examples/gpt2/run_clm_flax.py on 2 nodes, each with 8 V100 GPUS clustered using Ray.

2022-10-31 12:54:24,895 INFO usage_lib.py:483 -- Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2022-10-31 12:54:24,895 INFO scripts.py:719 -- Local node IP: 172.31.20.63
2022-10-31 12:54:27,132 SUCC scripts.py:756 -- --------------------
2022-10-31 12:54:27,132 SUCC scripts.py:757 -- Ray runtime started.
2022-10-31 12:54:27,132 SUCC scripts.py:758 -- --------------------
2022-10-31 12:54:27,132 INFO scripts.py:760 -- Next steps
2022-10-31 12:54:27,132 INFO scripts.py:761 -- To connect to this Ray runtime from another node, run
2022-10-31 12:54:27,132 INFO scripts.py:764 --   ray start --address='172.31.20.63:6379'
2022-10-31 12:54:27,132 INFO scripts.py:780 -- Alternatively, use the following Python code:
2022-10-31 12:54:27,132 INFO scripts.py:782 -- import ray
2022-10-31 12:54:27,132 INFO scripts.py:786 -- ray.init(address='auto')
2022-10-31 12:54:27,133 INFO scripts.py:798 -- To connect to this Ray runtime from outside of the cluster, for example to
2022-10-31 12:54:27,133 INFO scripts.py:802 -- connect to a remote cluster from your laptop directly, use the following
2022-10-31 12:54:27,133 INFO scripts.py:806 -- Python code:
2022-10-31 12:54:27,133 INFO scripts.py:808 -- import ray
2022-10-31 12:54:27,133 INFO scripts.py:809 -- ray.init(address='ray://<head_node_ip_address>:10001')
2022-10-31 12:54:27,133 INFO scripts.py:818 -- If connection fails, check your firewall settings and network configuration.
2022-10-31 12:54:27,133 INFO scripts.py:826 -- To terminate the Ray runtime, run
2022-10-31 12:54:27,133 INFO scripts.py:827 --   ray stop
2022-10-31 12:54:27,133 INFO scripts.py:905 -- --block
2022-10-31 12:54:27,133 INFO scripts.py:906 -- This command will now block forever until terminated by a signal.
2022-10-31 12:54:27,133 INFO scripts.py:909 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
[2022-10-31 12:54:33,153 W 3285 3285] global_state_accessor.cc:390: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
2022-10-31 12:54:33,016 INFO scripts.py:883 -- Local node IP: 172.31.20.42
2022-10-31 12:54:34,157 SUCC scripts.py:895 -- --------------------
2022-10-31 12:54:34,158 SUCC scripts.py:896 -- Ray runtime started.
2022-10-31 12:54:34,158 SUCC scripts.py:897 -- --------------------
2022-10-31 12:54:34,158 INFO scripts.py:899 -- To terminate the Ray runtime, run
2022-10-31 12:54:34,158 INFO scripts.py:900 --   ray stop
2022-10-31 12:54:34,158 INFO scripts.py:905 -- --block
2022-10-31 12:54:34,158 INFO scripts.py:906 -- This command will now block forever until terminated by a signal.
2022-10-31 12:54:34,158 INFO scripts.py:909 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
======== Autoscaler status: 2022-10-31 12:54:46.646297 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_aac52388a435f09badc3236ededb32791f7bd4f87087c275a17b699b
 1 node_488381494d777c9a8307480fe252c986bf011571367b6da9c9bbe220
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/160.0 CPU
 0.0/16.0 GPU
 0.0/2.0 accelerator_type:V100
 0.00/680.878 GiB memory
 0.00/295.796 GiB object_store_memory

Demands:
 (no resource demands)
2022-10-31 12:55:20,039 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: 172.31.20.63:6379...
2022-10-31 12:55:20,051 INFO worker.py:1518 -- Connected to Ray cluster.
INFO:__main__:Training/evaluation parameters TrainingArguments(output_dir='/data/model_v1_2_2048', overwrite_output_dir=True, do_train=True, do_eval=True, per_device_train_batch_size=4, per_device_eval_batch_size=4, num_micro_batches=4, learning_rate=0.001, weight_decay=0.01, adam_beta1=0.9, adam_beta2=0.98, adam_epsilon=1e-08, adafactor=False, num_train_epochs=5.0, warmup_steps=1000, logging_steps=100, save_steps=63, eval_steps=63, seed=42, push_to_hub=False, hub_model_id=None, hub_token=None)
WARNING:datasets.builder:Reusing dataset wikitext (/home/ss1/.cache/huggingface/datasets/wikitext/wikitext-2-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)

  0%|          | 0/3 [00:00<?, ?it/s]
 67%|██████▋   | 2/3 [00:00<00:00,  4.58it/s]
100%|██████████| 3/3 [00:00<00:00,  5.84it/s]
loading configuration file /data/config.json
Model config GPT2Config {
  "_name_or_path": "/data",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.0,
  "bos_token_id": 50256,
  "embd_pdrop": 0.0,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 2048,
  "n_embd": 4096,
  "n_head": 32,
  "n_inner": null,
  "n_layer": 30,
  "n_positions": 2048,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.0,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.21.2",
  "use_cache": true,
  "vocab_size": 50257
}

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file /data/config.json
Model config GPT2Config {
  "_name_or_path": "/data",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.0,
  "bos_token_id": 50256,
  "embd_pdrop": 0.0,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 2048,
  "n_embd": 4096,
  "n_head": 32,
  "n_inner": null,
  "n_layer": 30,
  "n_positions": 2048,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.0,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.21.2",
  "use_cache": true,
  "vocab_size": 50257
}

Didn't find file /data/vocab.json. We won't load it.
Didn't find file /data/merges.txt. We won't load it.
Didn't find file /data/added_tokens.json. We won't load it.
Didn't find file /data/special_tokens_map.json. We won't load it.
Didn't find file /data/tokenizer_config.json. We won't load it.
loading file None
loading file None
loading file /data/tokenizer.json
loading file None
loading file None
loading file None
loading configuration file /data/config.json
Model config GPT2Config {
  "_name_or_path": "/data",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.0,
  "bos_token_id": 50256,
  "embd_pdrop": 0.0,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 2048,
  "n_embd": 4096,
  "n_head": 32,
  "n_inner": null,
  "n_layer": 30,
  "n_positions": 2048,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.0,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.21.2",
  "use_cache": true,
  "vocab_size": 50257
}

INFO:__main__:***** Tokenize dataset *****
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ss1/.cache/huggingface/datasets/wikitext/wikitext-2-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-eafb30dfb9bc41fc.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ss1/.cache/huggingface/datasets/wikitext/wikitext-2-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-accd64566af85863.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ss1/.cache/huggingface/datasets/wikitext/wikitext-2-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-b4f47b23e8bec562.arrow
INFO:__main__:***** Build dataset *****

  0%|          | 0/5 [00:00<?, ?ba/s]
 20%|██        | 1/5 [00:00<00:00,  7.65ba/s]
 60%|██████    | 3/5 [00:00<00:00, 10.76ba/s]
100%|██████████| 5/5 [00:00<00:00, 13.20ba/s]

  0%|          | 0/37 [00:00<?, ?ba/s]
  5%|▌         | 2/37 [00:00<00:03, 11.16ba/s]
 11%|█         | 4/37 [00:00<00:02, 12.48ba/s]
 16%|█▌        | 6/37 [00:00<00:02, 13.34ba/s]
 22%|██▏       | 8/37 [00:00<00:02, 12.62ba/s]
 27%|██▋       | 10/37 [00:00<00:02, 12.89ba/s]
 32%|███▏      | 12/37 [00:00<00:01, 12.92ba/s]
 38%|███▊      | 14/37 [00:01<00:01, 13.05ba/s]
 43%|████▎     | 16/37 [00:01<00:01, 13.00ba/s]
 49%|████▊     | 18/37 [00:01<00:01, 12.92ba/s]
 54%|█████▍    | 20/37 [00:01<00:01, 13.26ba/s]
 59%|█████▉    | 22/37 [00:01<00:01, 12.99ba/s]
 65%|██████▍   | 24/37 [00:01<00:00, 13.29ba/s]
 70%|███████   | 26/37 [00:01<00:00, 13.46ba/s]
 76%|███████▌  | 28/37 [00:02<00:00, 12.91ba/s]
 81%|████████  | 30/37 [00:02<00:00, 13.21ba/s]
 86%|████████▋ | 32/37 [00:02<00:00, 12.73ba/s]
 92%|█████████▏| 34/37 [00:02<00:00, 12.61ba/s]
 97%|█████████▋| 36/37 [00:02<00:00, 12.62ba/s]
100%|██████████| 37/37 [00:02<00:00, 12.97ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]
 50%|█████     | 2/4 [00:00<00:00, 12.01ba/s]
100%|██████████| 4/4 [00:00<00:00, 13.19ba/s]
100%|██████████| 4/4 [00:00<00:00, 12.99ba/s]
INFO:__main__:***** Running training *****
INFO:__main__:  Num examples = 1509
INFO:__main__:  Num Epochs = 5
INFO:__main__:  Batch size per device (w. accumulation) = 4
INFO:__main__:  Global train batch size (w. parallel & distributed) = 64
INFO:__main__:  Total optimization steps = 115
Initial compilation. This might take some minutes...

Epoch ... :   0%|          | 0/5 [00:00<?, ?it/s]

Epoch ... :   0%|          | 0/5 [00:00<?, ?it/s]

Training...:   0%|          | 0/23 [00:00<?, ?it/s](raylet) Spilled 4096 MiB, 8 objects, write throughput 1130 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
(raylet) Spilled 6144 MiB, 12 objects, write throughput 1257 MiB/s.
(raylet) Spilled 11265 MiB, 21 objects, write throughput 1304 MiB/s.
(raylet) Spilled 17410 MiB, 35 objects, write throughput 1431 MiB/s.
(raylet) Spilled 35844 MiB, 69 objects, write throughput 1606 MiB/s.
(raylet) Spilled 65800 MiB, 146 objects, write throughput 1562 MiB/s.
(raylet) Spilled 131207 MiB, 636 objects, write throughput 1740 MiB/s.
(raylet) [2022-10-31 12:59:35,557 E 62901 62931] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2022-10-31_12-54-25_011441_62842 is over 95% full, available space: 13307080704; capacity: 270475862016. Object creation will fail if spilling is required.
(raylet) [2022-10-31 12:59:45,567 E 62901 62931] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2022-10-31_12-54-25_011441_62842 is over 95% full, available space: 5092540416; capacity: 270475862016. Object creation will fail if spilling is required.
2022-10-31 12:59:54,241 WARNING worker.py:1829 -- Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 908, in ray._raylet.spill_objects_handler
  File "python/ray/_raylet.pyx", line 911, in ray._raylet.spill_objects_handler
  File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/ray/_private/external_storage.py", line 657, in spill_objects
    return _external_storage.spill_objects(object_refs, owner_addresses)
  File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/ray/_private/external_storage.py", line 302, in spill_objects
    return self._write_multiple_objects(f, object_refs, owner_addresses, url)
  File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/ray/_private/external_storage.py", line 149, in _write_multiple_objects
    written_bytes = f.write(payload)
OSError: [Errno 28] No space left on device
An unexpected internal error occurred while the IO worker was spilling objects: [Errno 28] No space left on device
(raylet) Spilled 297799 MiB, 1099 objects, write throughput 2550 MiB/s.
(raylet) [2022-10-31 12:59:55,577 E 62901 62931] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2022-10-31_12-54-25_011441_62842 is over 95% full, available space: 0; capacity: 270475862016. Object creation will fail if spilling is required.
(raylet) [2022-10-31 13:00:05,587 E 62901 62931] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2022-10-31_12-54-25_011441_62842 is over 95% full, available space: 0; capacity: 270475862016. Object creation will fail if spilling is required.
(MeshHostWorker pid=64024) 2022-10-31 13:00:10.999771: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:456] ptxas returned an error during compilation of ptx to sass: 'INTERNAL: ptxas exited with non-zero error code 65280, output: ptxas fatal   : Internal error: writing file
(MeshHostWorker pid=64024) '  If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.
(MeshHostWorker pid=64024) *** SIGABRT received at time=1667206810 on cpu 16 ***
(MeshHostWorker pid=64024) PC: @     0x7f9e37b1400b  (unknown)  raise
(MeshHostWorker pid=64024)     @     0x7f9e37e3b420  (unknown)  (unknown)
(MeshHostWorker pid=64024)     @     0x7f78bf932519       2128  xla::gpu::NVPTXCompiler::CompileGpuAsmOrGetCachedResult()
(MeshHostWorker pid=64024)     @     0x7f78bf933935       2752  xla::gpu::NVPTXCompiler::CompileTargetBinary()
(MeshHostWorker pid=64024)     @     0x7f78bf97838d        784  xla::gpu::GpuCompiler::CompileToTargetBinary()::{lambda()#1}::operator()()
(MeshHostWorker pid=64024)     @     0x7f78bf97c9b0        912  xla::gpu::GpuCompiler::CompileToTargetBinary()
(MeshHostWorker pid=64024)     @     0x7f78bf985774       1792  xla::gpu::GpuCompiler::RunBackend()
(MeshHostWorker pid=64024)     @     0x7f78c23f59da        304  xla::LLVMCompiler::Compile()
(MeshHostWorker pid=64024)     @     0x7f78bf90a1ca        640  xla::Service::BuildExecutables()
(MeshHostWorker pid=64024)     @     0x7f78bf6f8a4a       1056  xla::LocalService::CompileExecutables()
(MeshHostWorker pid=64024)     @     0x7f78bf6f18a6       2672  xla::LocalClient::Compile()
(MeshHostWorker pid=64024)     @     0x7f78bf5dcaa9        832  xla::PjRtStreamExecutorClient::Compile()
(MeshHostWorker pid=64024)     @     0x7f78bf547cc2       1072  xla::PyClient::Compile()
(MeshHostWorker pid=64024)     @     0x7f78be8f5c82       1824  pybind11::detail::argument_loader<>::call_impl<>()
(MeshHostWorker pid=64024)     @     0x7f78be8f6066        192  pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
(MeshHostWorker pid=64024)     @     0x7f78be8cce3b        608  pybind11::cpp_function::dispatcher()
(MeshHostWorker pid=64024)     @     0x5630926b026c  (unknown)  cfunction_call
(MeshHostWorker pid=64024)     @ ... and at least 3 more frames
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361: *** SIGABRT received at time=1667206811 on cpu 16 ***
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361: PC: @     0x7f9e37b1400b  (unknown)  raise
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f9e37e3b420  (unknown)  (unknown)
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78bf932519       2128  xla::gpu::NVPTXCompiler::CompileGpuAsmOrGetCachedResult()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78bf933935       2752  xla::gpu::NVPTXCompiler::CompileTargetBinary()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78bf97838d        784  xla::gpu::GpuCompiler::CompileToTargetBinary()::{lambda()#1}::operator()()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78bf97c9b0        912  xla::gpu::GpuCompiler::CompileToTargetBinary()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78bf985774       1792  xla::gpu::GpuCompiler::RunBackend()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78c23f59da        304  xla::LLVMCompiler::Compile()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78bf90a1ca        640  xla::Service::BuildExecutables()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78bf6f8a4a       1056  xla::LocalService::CompileExecutables()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78bf6f18a6       2672  xla::LocalClient::Compile()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78bf5dcaa9        832  xla::PjRtStreamExecutorClient::Compile()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78bf547cc2       1072  xla::PyClient::Compile()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78be8f5c82       1824  pybind11::detail::argument_loader<>::call_impl<>()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78be8f6066        192  pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78be8cce3b        608  pybind11::cpp_function::dispatcher()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x5630926b026c  (unknown)  cfunction_call
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @ ... and at least 3 more frames
(MeshHostWorker pid=64024) Fatal Python error: Aborted
(MeshHostWorker pid=64024) 
(MeshHostWorker pid=64024) Stack (most recent call first):
(MeshHostWorker pid=64024)   File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/alpa/shard_parallel/auto_sharding.py", line 457 in run_backend_compilation
(MeshHostWorker pid=64024)   File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/alpa/mesh_executable.py", line 833 in __init__
(MeshHostWorker pid=64024)   File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/alpa/device_mesh.py", line 268 in put_executable
(MeshHostWorker pid=64024)   File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py", line 466 in _resume_span
(MeshHostWorker pid=64024)   File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/ray/_private/function_manager.py", line 674 in actor_method_executor
(MeshHostWorker pid=64024)   File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/ray/_private/worker.py", line 756 in main_loop
(MeshHostWorker pid=64024)   File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/ray/_private/workers/default_worker.py", line 237 in <module>
2022-10-31 13:00:11,823 WARNING worker.py:1829 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff2181028490594cb8317870f301000000 Worker ID: f7c05586338c76308dc069e91c32fba70e982755bdc6384ce392243a Node ID: aac52388a435f09badc3236ededb32791f7bd4f87087c275a17b699b Worker IP address: 172.31.20.63 Worker port: 10005 Worker PID: 64024 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

                                                   
Epoch ... :   0%|          | 0/5 [03:28<?, ?it/s]
Traceback (most recent call last):
  File "/home/ss1/project/language_model/run_clm_alpa_flax.py", line 892, in <module>
    main()
  File "/home/ss1/project/language_model/run_clm_alpa_flax.py", line 778, in main
    executable.sync()
  File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/alpa/mesh_executable.py", line 119, in sync
    self.physical_mesh.sync_workers()
  File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/alpa/device_mesh.py", line 1309, in sync_workers
    ray.get([w.sync.remote() for w in self.workers])
  File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/ray/_private/worker.py", line 2277, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
    class_name: MeshHostWorker
    actor_id: 2181028490594cb8317870f301000000
    pid: 64024
    namespace: fc0f0709-1fa6-4712-96a5-250c13fc094d
    ip: 172.31.20.63
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Exception ignored in: <function RemoteArrayRef.__del__ at 0x7fd1a4089c10>
Traceback (most recent call last):
  File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/alpa/device_mesh.py", line 1373, in __del__
  File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/alpa/device_mesh.py", line 1140, in delete_remote_buffers
  File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 98, in wrapper
ImportError: sys.meta_path is None, Python is likely shutting down
Exception ignored in: <function RemoteArrayRef.__del__ at 0x7fd1a4089c10>

Please describe the expected behavior

System information and environment

To Reproduce Steps to reproduce the behavior:

  1. Command used to run sbatch call_alpa.sh

  2. call_alpa.sh

    
    #!/bin/bash
    #SBATCH -n 2
    #SBATCH -c 80
    #SBATCH --ntasks-per-node=1
    #SBATCH --gres=gpu:v100:8
    #SBATCH --time=48:00:00
    #SBATCH --reservation=ss1_4
    #SBATCH --output=slurm/alpa/sep31/gpt2_%j.out

module load singularity export ALPA_IMG="/home/pub/singularity/general/alpa-v0.1.6.sif"

export SINGULARITYENV_PREPEND_PATH="/opt/conda/envs/alpa/bin"

export HEAD="$(scontrol show hostname ${SLURM_NODELIST} | head -n1)" export RAY_PORT="6379"

srun singularity exec --nv --bind /home/ss1/project/language_model/model_dumps/alpa_models:/data ${ALPA_IMG} bash alpa_cmd.sh

3. alpa_cmd.sh
```bash
#!/bin/bash

echo "${SLURM_NTASKS} ${SLURM_PROCID} ${SLURM_LOCALID}"
export SLURM_GPUS_PER_TASK=8
export SLURM_CPUS_PER_TASK=80

if [ "$(hostname -s)" =  "${HEAD}" ]
then
  ray start --head --block &
else
  sleep 15
  ray start --address="${HEAD}:${RAY_PORT}" --block
fi

sleep 30

if [ "$(hostname -s)" =  "${HEAD}" ]
then
  ray status
  python3 run_clm_flax.py \
    --output_dir="/data/model_v1_2_2048" \
    --model_type="gpt2" \
    --config_name="/data" \
    --tokenizer_name="/data" \
    --dataset_name="wikitext" \
    --dataset_config_name="wikitext-2-v1" \
    --do_train --do_eval \
    --block_size="2048" \
    --per_device_train_batch_size="4" \
    --per_device_eval_batch_size="4" \
    --num_micro_batches="4" \
    --dtype="float16" \
    --learning_rate="1e-3" --warmup_steps="1000" \
    --adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \
    --overwrite_output_dir \
    --num_train_epochs="5" \
    --logging_steps="100" \
    --save_steps="63" \
    --eval_steps="63"

fi

config.json

{
  "_name_or_path": "./oscar_data",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.0,
  "bos_token_id": 50256,
  "embd_pdrop": 0.0,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 2048,
  "n_embd": 4096,
  "n_head": 32,
  "n_inner": null,
  "n_layer": 30,
  "n_positions": 2048,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.0,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.23.0.dev0",
  "use_cache": true,
  "vocab_size": 50257
}
  1. See error

Screenshots

INFO:__main__:***** Running training *****
INFO:__main__:  Num examples = 1509
INFO:__main__:  Num Epochs = 5
INFO:__main__:  Batch size per device (w. accumulation) = 4
INFO:__main__:  Global train batch size (w. parallel & distributed) = 64
INFO:__main__:  Total optimization steps = 115
Initial compilation. This might take some minutes...

Epoch ... :   0%|          | 0/5 [00:00<?, ?it/s]

Epoch ... :   0%|          | 0/5 [00:00<?, ?it/s]

Training...:   0%|          | 0/23 [00:00<?, ?it/s](raylet) Spilled 4096 MiB, 8 objects, write throughput 1130 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
(raylet) Spilled 6144 MiB, 12 objects, write throughput 1257 MiB/s.
(raylet) Spilled 11265 MiB, 21 objects, write throughput 1304 MiB/s.
(raylet) Spilled 17410 MiB, 35 objects, write throughput 1431 MiB/s.
(raylet) Spilled 35844 MiB, 69 objects, write throughput 1606 MiB/s.
(raylet) Spilled 65800 MiB, 146 objects, write throughput 1562 MiB/s.
(raylet) Spilled 131207 MiB, 636 objects, write throughput 1740 MiB/s.
(raylet) [2022-10-31 12:59:35,557 E 62901 62931] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2022-10-31_12-54-25_011441_62842 is over 95% full, available space: 13307080704; capacity: 270475862016. Object creation will fail if spilling is required.
(raylet) [2022-10-31 12:59:45,567 E 62901 62931] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2022-10-31_12-54-25_011441_62842 is over 95% full, available space: 5092540416; capacity: 270475862016. Object creation will fail if spilling is required.
2022-10-31 12:59:54,241 WARNING worker.py:1829 -- Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 908, in ray._raylet.spill_objects_handler
  File "python/ray/_raylet.pyx", line 911, in ray._raylet.spill_objects_handler
  File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/ray/_private/external_storage.py", line 657, in spill_objects
    return _external_storage.spill_objects(object_refs, owner_addresses)
  File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/ray/_private/external_storage.py", line 302, in spill_objects
    return self._write_multiple_objects(f, object_refs, owner_addresses, url)
  File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/ray/_private/external_storage.py", line 149, in _write_multiple_objects
    written_bytes = f.write(payload)
OSError: [Errno 28] No space left on device
An unexpected internal error occurred while the IO worker was spilling objects: [Errno 28] No space left on device
(raylet) Spilled 297799 MiB, 1099 objects, write throughput 2550 MiB/s.
(raylet) [2022-10-31 12:59:55,577 E 62901 62931] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2022-10-31_12-54-25_011441_62842 is over 95% full, available space: 0; capacity: 270475862016. Object creation will fail if spilling is required.
(raylet) [2022-10-31 13:00:05,587 E 62901 62931] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2022-10-31_12-54-25_011441_62842 is over 95% full, available space: 0; capacity: 270475862016. Object creation will fail if spilling is required.
(MeshHostWorker pid=64024) 2022-10-31 13:00:10.999771: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:456] ptxas returned an error during compilation of ptx to sass: 'INTERNAL: ptxas exited with non-zero error code 65280, output: ptxas fatal   : Internal error: writing file
(MeshHostWorker pid=64024) '  If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.
(MeshHostWorker pid=64024) *** SIGABRT received at time=1667206810 on cpu 16 ***
(MeshHostWorker pid=64024) PC: @     0x7f9e37b1400b  (unknown)  raise
(MeshHostWorker pid=64024)     @     0x7f9e37e3b420  (unknown)  (unknown)
(MeshHostWorker pid=64024)     @     0x7f78bf932519       2128  xla::gpu::NVPTXCompiler::CompileGpuAsmOrGetCachedResult()
(MeshHostWorker pid=64024)     @     0x7f78bf933935       2752  xla::gpu::NVPTXCompiler::CompileTargetBinary()
(MeshHostWorker pid=64024)     @     0x7f78bf97838d        784  xla::gpu::GpuCompiler::CompileToTargetBinary()::{lambda()#1}::operator()()
(MeshHostWorker pid=64024)     @     0x7f78bf97c9b0        912  xla::gpu::GpuCompiler::CompileToTargetBinary()
(MeshHostWorker pid=64024)     @     0x7f78bf985774       1792  xla::gpu::GpuCompiler::RunBackend()
(MeshHostWorker pid=64024)     @     0x7f78c23f59da        304  xla::LLVMCompiler::Compile()
(MeshHostWorker pid=64024)     @     0x7f78bf90a1ca        640  xla::Service::BuildExecutables()
(MeshHostWorker pid=64024)     @     0x7f78bf6f8a4a       1056  xla::LocalService::CompileExecutables()
(MeshHostWorker pid=64024)     @     0x7f78bf6f18a6       2672  xla::LocalClient::Compile()
(MeshHostWorker pid=64024)     @     0x7f78bf5dcaa9        832  xla::PjRtStreamExecutorClient::Compile()
(MeshHostWorker pid=64024)     @     0x7f78bf547cc2       1072  xla::PyClient::Compile()
(MeshHostWorker pid=64024)     @     0x7f78be8f5c82       1824  pybind11::detail::argument_loader<>::call_impl<>()
(MeshHostWorker pid=64024)     @     0x7f78be8f6066        192  pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
(MeshHostWorker pid=64024)     @     0x7f78be8cce3b        608  pybind11::cpp_function::dispatcher()
(MeshHostWorker pid=64024)     @     0x5630926b026c  (unknown)  cfunction_call
(MeshHostWorker pid=64024)     @ ... and at least 3 more frames
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361: *** SIGABRT received at time=1667206811 on cpu 16 ***
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361: PC: @     0x7f9e37b1400b  (unknown)  raise
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f9e37e3b420  (unknown)  (unknown)
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78bf932519       2128  xla::gpu::NVPTXCompiler::CompileGpuAsmOrGetCachedResult()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78bf933935       2752  xla::gpu::NVPTXCompiler::CompileTargetBinary()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78bf97838d        784  xla::gpu::GpuCompiler::CompileToTargetBinary()::{lambda()#1}::operator()()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78bf97c9b0        912  xla::gpu::GpuCompiler::CompileToTargetBinary()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78bf985774       1792  xla::gpu::GpuCompiler::RunBackend()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78c23f59da        304  xla::LLVMCompiler::Compile()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78bf90a1ca        640  xla::Service::BuildExecutables()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78bf6f8a4a       1056  xla::LocalService::CompileExecutables()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78bf6f18a6       2672  xla::LocalClient::Compile()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78bf5dcaa9        832  xla::PjRtStreamExecutorClient::Compile()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78bf547cc2       1072  xla::PyClient::Compile()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78be8f5c82       1824  pybind11::detail::argument_loader<>::call_impl<>()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78be8f6066        192  pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x7f78be8cce3b        608  pybind11::cpp_function::dispatcher()
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @     0x5630926b026c  (unknown)  cfunction_call
(MeshHostWorker pid=64024) [2022-10-31 13:00:11,531 E 64024 64024] logging.cc:361:     @ ... and at least 3 more frames
(MeshHostWorker pid=64024) Fatal Python error: Aborted
(MeshHostWorker pid=64024) 
(MeshHostWorker pid=64024) Stack (most recent call first):
(MeshHostWorker pid=64024)   File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/alpa/shard_parallel/auto_sharding.py", line 457 in run_backend_compilation
(MeshHostWorker pid=64024)   File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/alpa/mesh_executable.py", line 833 in __init__
(MeshHostWorker pid=64024)   File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/alpa/device_mesh.py", line 268 in put_executable
(MeshHostWorker pid=64024)   File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py", line 466 in _resume_span
(MeshHostWorker pid=64024)   File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/ray/_private/function_manager.py", line 674 in actor_method_executor
(MeshHostWorker pid=64024)   File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/ray/_private/worker.py", line 756 in main_loop
(MeshHostWorker pid=64024)   File "/opt/miniconda/envs/alpa/lib/python3.9/site-packages/ray/_private/workers/default_worker.py", line 237 in <module>
2022-10-31 13:00:11,823 WARNING worker.py:1829 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff2181028490594cb8317870f301000000 Worker ID: f7c05586338c76308dc069e91c32fba70e982755bdc6384ce392243a Node ID: aac52388a435f09badc3236ededb32791f7bd4f87087c275a17b699b Worker IP address: 172.31.20.63 Worker port: 10005 Worker PID: 64024 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

Docker file used to create an image

FROM nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04

# init workdir
RUN mkdir -p /build
WORKDIR /build

# install common tool & conda
RUN apt update && \
    apt install wget -y && \
    apt install git -y && \
    apt install vim -y && \
    wget --quiet https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh -O ~/anaconda.sh && \
    /bin/bash ~/anaconda.sh -b -p /opt/conda && \
    rm ~/anaconda.sh && \
    mkdir -p /opt/conda/envs/alpa && \
    ln -s /opt/conda/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
    echo ". /opt/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
    echo "conda activate base" >> ~/.bashrc

# install conda alpa env
RUN . /opt/conda/etc/profile.d/conda.sh && \
    conda create --name alpa python=3.8 -y && \
    conda activate alpa && \
    apt install coinor-cbc -y && \
    pip3 install --upgrade pip && \
    pip3 install cupy-cuda113 && \
    pip3 install alpa && \
    pip3 install jaxlib==0.3.15+cuda113.cudnn820 -f https://alpa-projects.github.io/wheels.html

Additional information The Docker image is converted to a singularity file and running using SIF file

zhisbug commented 1 year ago

@sunilitggu it seems you do not have enough disk space? could you try to increase the disk space allocated or this docker container?

zhisbug commented 1 year ago

@sunilitggu if you are not able to increase the disk size, you can consider:

  1. downgrade ray to use ray 1.13? the object spilling there might be less aggressive
  2. maybe you can also try to read ray doc and find out how to disable object spilling in ray
zhisbug commented 1 year ago

@sunilitggu is your problem solved?

sunilitggu commented 1 year ago

@sunilitggu is your problem solved?

Not yet.

zhisbug commented 1 year ago

@sunilitggu Have you communicated with @aurickq about the suggested solution?

The root cause is that you're loading the entire training dataset into the ray object store; when the object store has 60% of the space used ray will start spilling some contents to the disk. Meanwhile, you do not allocate enough disk space for your docker container.

Could you try:

  1. do not load your entire dataset into ray object store
  2. try to allocate more disk space for your docker container?
sunilitggu commented 1 year ago
  1. wikitext-2-v1 is not a huge dataset, it has only 1509 examples. Do you think this will create problem?
  2. I am running the code using a singularity container (SIF file). I haven't found any way to increase disk space in this.
zhisbug commented 1 year ago
  1. Looking at the errors below, the amount of data spilled it’s around 131GBs in total, roughly 500MB for each object. Could you figure out what these objects are? That might give some hints

    �[2m�[36m(raylet)�[0m Spilled 6144 MiB, 12 objects, write throughput 1257 MiB/s.
    �[2m�[36m(raylet)�[0m Spilled 11265 MiB, 21 objects, write throughput 1304 MiB/s.
    �[2m�[36m(raylet)�[0m Spilled 17410 MiB, 35 objects, write throughput 1431 MiB/s.
    �[2m�[36m(raylet)�[0m Spilled 35844 MiB, 69 objects, write throughput 1606 MiB/s.
    �[2m�[36m(raylet)�[0m Spilled 65800 MiB, 146 objects, write throughput 1562 MiB/s.
    �[2m�[36m(raylet)�[0m Spilled 131207 MiB, 636 objects, write throughput 1740 MiB/s.
  2. I am not familiar with SIF so cannot help further in this direction.

zhisbug commented 1 year ago

@sunilitggu is your problem solved?

sunilitggu commented 1 year ago

@zhisbug Not yet. I haven't got time to explore it further.

merrymercy commented 1 year ago

closed due to inactivity