Training fails with an error `WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202100 closing signal SIGTERM`

rustic-snob commented 7 months ago

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I firstly ran python -m axolotl.cli.preprocess examples/llama-2/ver2.0.yml because I have a lot of data(total_num_tokens: 10394324568) It did ran successfully and data saved in last_run_prepared folder. after that, I ran accelerate launch -m axolotl.cli.train examples/llama-2/ver2.0.yml to train.

Current behaviour

But, the train hangs from here about 10min,

(axo) root@notebook-deployment-25-5b4fb57786-p6qqr:~/fileviewer/LLM/axolotl# accelerate launch -m axolotl.cli.train examples/llama-2/ver2.0.yml 
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_processes` was set to a value of `8`
                More than one GPU was found, enabling multi-GPU training.
                If this was unintended please pass in `--num_processes=1`.
        `--num_machines` was set to a value of `1`
        `--mixed_precision` was set to a value of `'no'`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/home/jovyan/.local/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
[2024-02-07 10:17:21,364] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/jovyan/.local/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/jovyan/.local/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/jovyan/.local/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/jovyan/.local/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/jovyan/.local/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/jovyan/.local/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/jovyan/.local/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
[2024-02-07 10:17:23,822] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-07 10:17:23,831] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-07 10:17:23,836] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-07 10:17:23,840] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-07 10:17:24,091] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-07 10:17:24,095] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-07 10:17:24,149] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-07 10:17:28,326] [INFO] [axolotl.normalize_config:150] [PID:202106] [RANK:6] GPU memory usage baseline: 0.000GB (+1.723GB misc)
[2024-02-07 10:17:29,878] [INFO] [axolotl.normalize_config:150] [PID:202100] [RANK:0] GPU memory usage baseline: 0.000GB (+3.121GB misc)
[2024-02-07 10:17:30,110] [INFO] [axolotl.normalize_config:150] [PID:202105] [RANK:5] GPU memory usage baseline: 0.000GB (+1.723GB misc)
[2024-02-07 10:17:30,226] [INFO] [axolotl.normalize_config:150] [PID:202107] [RANK:7] GPU memory usage baseline: 0.000GB (+1.863GB misc)
[2024-02-07 10:17:30,260] [INFO] [axolotl.normalize_config:150] [PID:202104] [RANK:4] GPU memory usage baseline: 0.000GB (+1.723GB misc)
[2024-02-07 10:17:30,635] [INFO] [axolotl.normalize_config:150] [PID:202103] [RANK:3] GPU memory usage baseline: 0.000GB (+1.723GB misc)
[2024-02-07 10:17:30,645] [INFO] [axolotl.normalize_config:150] [PID:202101] [RANK:1] GPU memory usage baseline: 0.000GB (+2.285GB misc)
[2024-02-07 10:17:30,694] [INFO] [axolotl.normalize_config:150] [PID:202102] [RANK:2] GPU memory usage baseline: 0.000GB (+1.723GB misc)
                                 dP            dP   dP 
                                 88            88   88 
      .d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88 
      88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88 
      88.  .88  .d88b.  88.  .88 88 88.  .88   88   88 
      `88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP 

[2024-02-07 10:17:32,656] [DEBUG] [axolotl.load_tokenizer:210] [PID:202100] [RANK:0] EOS: 57290 / </eot>
[2024-02-07 10:17:32,656] [DEBUG] [axolotl.load_tokenizer:211] [PID:202100] [RANK:0] BOS: 1 / <s>
[2024-02-07 10:17:32,656] [DEBUG] [axolotl.load_tokenizer:212] [PID:202100] [RANK:0] PAD: 2 / </s>
[2024-02-07 10:17:32,656] [DEBUG] [axolotl.load_tokenizer:213] [PID:202100] [RANK:0] UNK: 0 / <unk>
[2024-02-07 10:17:32,656] [INFO] [axolotl.load_tokenizer:218] [PID:202100] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-02-07 10:17:32,670] [DEBUG] [axolotl.load_tokenizer:210] [PID:202106] [RANK:6] EOS: 57290 / </eot>
[2024-02-07 10:17:32,670] [DEBUG] [axolotl.load_tokenizer:211] [PID:202106] [RANK:6] BOS: 1 / <s>
[2024-02-07 10:17:32,670] [DEBUG] [axolotl.load_tokenizer:212] [PID:202106] [RANK:6] PAD: 2 / </s>
[2024-02-07 10:17:32,670] [DEBUG] [axolotl.load_tokenizer:213] [PID:202106] [RANK:6] UNK: 0 / <unk>
[2024-02-07 10:17:32,670] [INFO] [axolotl.load_tokenizer:218] [PID:202106] [RANK:6] No Chat template selected. Consider adding a chat template for easier inference.
[2024-02-07 10:17:32,676] [DEBUG] [axolotl.load_tokenizer:210] [PID:202101] [RANK:1] EOS: 57290 / </eot>
[2024-02-07 10:17:32,676] [DEBUG] [axolotl.load_tokenizer:211] [PID:202101] [RANK:1] BOS: 1 / <s>
[2024-02-07 10:17:32,676] [DEBUG] [axolotl.load_tokenizer:212] [PID:202101] [RANK:1] PAD: 2 / </s>
[2024-02-07 10:17:32,676] [DEBUG] [axolotl.load_tokenizer:213] [PID:202101] [RANK:1] UNK: 0 / <unk>
[2024-02-07 10:17:32,676] [INFO] [axolotl.load_tokenizer:218] [PID:202101] [RANK:1] No Chat template selected. Consider adding a chat template for easier inference.
[2024-02-07 10:17:32,679] [DEBUG] [axolotl.load_tokenizer:210] [PID:202107] [RANK:7] EOS: 57290 / </eot>
[2024-02-07 10:17:32,679] [DEBUG] [axolotl.load_tokenizer:211] [PID:202107] [RANK:7] BOS: 1 / <s>
[2024-02-07 10:17:32,679] [DEBUG] [axolotl.load_tokenizer:212] [PID:202107] [RANK:7] PAD: 2 / </s>
[2024-02-07 10:17:32,679] [DEBUG] [axolotl.load_tokenizer:213] [PID:202107] [RANK:7] UNK: 0 / <unk>
[2024-02-07 10:17:32,679] [INFO] [axolotl.load_tokenizer:218] [PID:202107] [RANK:7] No Chat template selected. Consider adding a chat template for easier inference.
[2024-02-07 10:17:32,682] [DEBUG] [axolotl.load_tokenizer:210] [PID:202105] [RANK:5] EOS: 57290 / </eot>
[2024-02-07 10:17:32,682] [DEBUG] [axolotl.load_tokenizer:211] [PID:202105] [RANK:5] BOS: 1 / <s>
[2024-02-07 10:17:32,682] [DEBUG] [axolotl.load_tokenizer:212] [PID:202105] [RANK:5] PAD: 2 / </s>
[2024-02-07 10:17:32,682] [DEBUG] [axolotl.load_tokenizer:213] [PID:202105] [RANK:5] UNK: 0 / <unk>
[2024-02-07 10:17:32,682] [INFO] [axolotl.load_tokenizer:218] [PID:202105] [RANK:5] No Chat template selected. Consider adding a chat template for easier inference.
[2024-02-07 10:17:32,689] [DEBUG] [axolotl.load_tokenizer:210] [PID:202104] [RANK:4] EOS: 57290 / </eot>
[2024-02-07 10:17:32,689] [DEBUG] [axolotl.load_tokenizer:211] [PID:202104] [RANK:4] BOS: 1 / <s>
[2024-02-07 10:17:32,689] [DEBUG] [axolotl.load_tokenizer:212] [PID:202104] [RANK:4] PAD: 2 / </s>
[2024-02-07 10:17:32,689] [DEBUG] [axolotl.load_tokenizer:213] [PID:202104] [RANK:4] UNK: 0 / <unk>
[2024-02-07 10:17:32,689] [INFO] [axolotl.load_tokenizer:218] [PID:202104] [RANK:4] No Chat template selected. Consider adding a chat template for easier inference.
[2024-02-07 10:17:32,692] [DEBUG] [axolotl.load_tokenizer:210] [PID:202102] [RANK:2] EOS: 57290 / </eot>
[2024-02-07 10:17:32,692] [DEBUG] [axolotl.load_tokenizer:211] [PID:202102] [RANK:2] BOS: 1 / <s>
[2024-02-07 10:17:32,692] [DEBUG] [axolotl.load_tokenizer:212] [PID:202102] [RANK:2] PAD: 2 / </s>
[2024-02-07 10:17:32,692] [DEBUG] [axolotl.load_tokenizer:213] [PID:202102] [RANK:2] UNK: 0 / <unk>
[2024-02-07 10:17:32,692] [INFO] [axolotl.load_tokenizer:218] [PID:202102] [RANK:2] No Chat template selected. Consider adding a chat template for easier inference.
[2024-02-07 10:17:32,747] [INFO] [axolotl.load_tokenized_prepared_datasets:156] [PID:202100] [RANK:0] Loading prepared dataset from disk at data/last_run_prepared/ver1.5/5a0d42bdf37a5f63628d102b050b290a...
[2024-02-07 10:17:33,049] [DEBUG] [axolotl.load_tokenizer:210] [PID:202103] [RANK:3] EOS: 57290 / </eot>
[2024-02-07 10:17:33,049] [DEBUG] [axolotl.load_tokenizer:211] [PID:202103] [RANK:3] BOS: 1 / <s>
[2024-02-07 10:17:33,049] [DEBUG] [axolotl.load_tokenizer:212] [PID:202103] [RANK:3] PAD: 2 / </s>
[2024-02-07 10:17:33,049] [DEBUG] [axolotl.load_tokenizer:213] [PID:202103] [RANK:3] UNK: 0 / <unk>
[2024-02-07 10:17:33,049] [INFO] [axolotl.load_tokenizer:218] [PID:202103] [RANK:3] No Chat template selected. Consider adding a chat template for easier inference.
[2024-02-07 10:17:36,276] [INFO] [axolotl.load_tokenized_prepared_datasets:158] [PID:202100] [RANK:0] Prepared dataset loaded from disk...
[2024-02-07 10:18:08,342] [INFO] [axolotl.load_tokenized_prepared_datasets:156] [PID:202101] [RANK:1] Loading prepared dataset from disk at data/last_run_prepared/ver1.5/5a0d42bdf37a5f63628d102b050b290a...
[2024-02-07 10:18:08,342] [INFO] [axolotl.load_tokenized_prepared_datasets:156] [PID:202102] [RANK:2] Loading prepared dataset from disk at data/last_run_prepared/ver1.5/5a0d42bdf37a5f63628d102b050b290a...
[2024-02-07 10:18:08,343] [INFO] [axolotl.load_tokenized_prepared_datasets:156] [PID:202103] [RANK:3] Loading prepared dataset from disk at data/last_run_prepared/ver1.5/5a0d42bdf37a5f63628d102b050b290a...
[2024-02-07 10:18:08,343] [INFO] [axolotl.load_tokenized_prepared_datasets:156] [PID:202106] [RANK:6] Loading prepared dataset from disk at data/last_run_prepared/ver1.5/5a0d42bdf37a5f63628d102b050b290a...
[2024-02-07 10:18:08,343] [INFO] [axolotl.load_tokenized_prepared_datasets:156] [PID:202105] [RANK:5] Loading prepared dataset from disk at data/last_run_prepared/ver1.5/5a0d42bdf37a5f63628d102b050b290a...
[2024-02-07 10:18:08,343] [INFO] [axolotl.load_tokenized_prepared_datasets:156] [PID:202107] [RANK:7] Loading prepared dataset from disk at data/last_run_prepared/ver1.5/5a0d42bdf37a5f63628d102b050b290a...
[2024-02-07 10:18:08,343] [INFO] [axolotl.load_tokenized_prepared_datasets:156] [PID:202104] [RANK:4] Loading prepared dataset from disk at data/last_run_prepared/ver1.5/5a0d42bdf37a5f63628d102b050b290a...
[2024-02-07 10:18:11,243] [INFO] [axolotl.load_tokenized_prepared_datasets:158] [PID:202101] [RANK:1] Prepared dataset loaded from disk...
[2024-02-07 10:18:11,297] [INFO] [axolotl.load_tokenized_prepared_datasets:158] [PID:202102] [RANK:2] Prepared dataset loaded from disk...
[2024-02-07 10:18:11,311] [INFO] [axolotl.load_tokenized_prepared_datasets:158] [PID:202103] [RANK:3] Prepared dataset loaded from disk...
[2024-02-07 10:18:12,061] [INFO] [axolotl.load_tokenized_prepared_datasets:158] [PID:202106] [RANK:6] Prepared dataset loaded from disk...
[2024-02-07 10:18:12,128] [INFO] [axolotl.load_tokenized_prepared_datasets:158] [PID:202104] [RANK:4] Prepared dataset loaded from disk...
[2024-02-07 10:18:12,296] [INFO] [axolotl.load_tokenized_prepared_datasets:158] [PID:202105] [RANK:5] Prepared dataset loaded from disk...
[2024-02-07 10:18:12,302] [INFO] [axolotl.load_tokenized_prepared_datasets:158] [PID:202107] [RANK:7] Prepared dataset loaded from disk...
[2024-02-07 10:18:18,753] [DEBUG] [axolotl.log:61] [PID:202100] [RANK:0] max_input_len: 2048
Filter (num_proc=92): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1093118/1093118 [00:18<00:00, 58856.73 examples/s]
[2024-02-07 10:18:52,990] [DEBUG] [axolotl.log:61] [PID:202100] [RANK:0] total_num_tokens: 104953362
[2024-02-07 10:19:03,674] [DEBUG] [axolotl.log:61] [PID:202100] [RANK:0] `total_supervised_tokens: 104953362`
[2024-02-07 10:19:10,842] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:202100] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 13119170
[2024-02-07 10:19:10,842] [DEBUG] [axolotl.log:61] [PID:202100] [RANK:0] data_loader_len: 50727
[2024-02-07 10:19:21,549] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:202105] [RANK:5] packing_efficiency_estimate: 1.0 total_num_tokens per device: 13119170
[2024-02-07 10:19:22,906] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:202103] [RANK:3] packing_efficiency_estimate: 1.0 total_num_tokens per device: 13119170
[2024-02-07 10:19:23,052] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:202101] [RANK:1] packing_efficiency_estimate: 1.0 total_num_tokens per device: 13119170
[2024-02-07 10:19:23,523] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:202102] [RANK:2] packing_efficiency_estimate: 1.0 total_num_tokens per device: 13119170
[2024-02-07 10:19:23,592] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:202107] [RANK:7] packing_efficiency_estimate: 1.0 total_num_tokens per device: 13119170
[2024-02-07 10:19:25,488] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:202104] [RANK:4] packing_efficiency_estimate: 1.0 total_num_tokens per device: 13119170
[2024-02-07 10:19:27,145] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:202106] [RANK:6] packing_efficiency_estimate: 1.0 total_num_tokens per device: 13119170
[2024-02-07 10:19:29,415] [INFO] [axolotl.log:61] [PID:202100] [RANK:0] sample_packing_eff_est across ranks: [0.8551673293113708, 0.8553243279457092, 0.856024444103241, 0.8551245331764221, 0.8540699481964111, 0.8542835116386414, 0.8545684218406677, 0.8550674915313721]
[2024-02-07 10:19:29,416] [DEBUG] [axolotl.log:61] [PID:202100] [RANK:0] sample_packing_eff_est: None
[2024-02-07 10:19:29,416] [DEBUG] [axolotl.log:61] [PID:202100] [RANK:0] total_num_steps: 6340
[2024-02-07 10:21:38,297] [DEBUG] [axolotl.log:61] [PID:202100] [RANK:0] total_num_tokens: 10394324568

After 10min, it crashes with these message

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202100 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202101 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202102 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202103 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202104 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202105 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202106 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 7 (pid: 202107) of binary: /opt/conda/envs/axo/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/axo/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/jovyan/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/jovyan/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1014, in launch_command
    multi_gpu_launcher(args)
  File "/home/jovyan/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 672, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/jovyan/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/jovyan/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/jovyan/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
axolotl.cli.train FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-07_10:38:21
  host      : notebook-deployment-25-5b4fb57786-p6qqr
  rank      : 7 (local_rank: 7)
  exitcode  : -9 (pid: 202107)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 202107
=======================================================

I thought it was due to something like timelimit, so I modified is_distributed function in distributed.py like below, but it does not helped.

def is_distributed():
    """
    Check if distributed training is initialized.
    """
    global accelerate  # pylint: disable=global-statement

    ipg_handler = InitProcessGroupKwargs(
            timeout=timedelta(seconds=54000)
            )
    if not accelerate:
        accelerate = Accelerator(
        kwargs_handlers=[ipg_handler],
        )
    return dist.is_available() and dist.is_initialized()

I also tried ddp_timeout: 99999 but it does not work, either.

Steps to reproduce

I just used ver2.0.yml below, and preprocess and then train.

Config yaml

seed: 42
ddp_timeout: 99999

base_model: ./konure_hollow
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
## whole bunch of datasets that has beed preprocessed with axolotl preprocess

dataset_prepared_path: ./data/last_run_prepared/ver1.5
val_set_size: 0.01
output_dir: ./results/ver2.0

sequence_len: 2048
sample_packing: true
pad_to_sequence_len: true
resize_token_embeddings_to_32x: true

adapter:
lora_model_dir:
lora_r:
lora_alpha:
lora_dropout:
lora_target_linear:
lora_fan_in_fan_out:

gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_hf
lr_scheduler: cosine
learning_rate: 0.00001

train_on_inputs: true
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: false
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_ratio: 0.15
evals_per_epoch: 20
eval_table_size:
saves_per_epoch: 25
save_total_limit: 2
debug:
deepspeed: deepspeed/zero2_ver1.0.json # multi-gpu only
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</eot>"
  unk_token: "<unk>"

unfrozen_parameters:
    - lm_head.*
    - model.embed_tokens.*
    - model.layers.4.*
    - model.layers.9.*
    - model.layers.14.*
    - model.layers.19.*
    - model.layers.24.*
    - model.layers.29.*
    - model.layers.34.*
    - model.layers.39.*

Possible solution

I think it is something to do with time limit, but I don't know how to fix this.

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

NanoCode012 commented 7 months ago

what's your hardware specifications? did you run out of ram/vram?

rustic-snob commented 7 months ago

what's your hardware specifications? did you run out of ram/vram?

I have A100-40GB*8 for vram and and ram is like below

              total        used        free      shared  buff/cache   available
Mem:          885Gi        81Gi       524Gi       9.4Gi       279Gi       787Gi
Swap:            0B          0B          0B

I monitored through, but did not ran out neither of them.

btw, is CUDA_VISIBLE_DEVICES="" necessary in doing python -m axolotl.cli.preprocess examples/llama-2/ver2.0.yml? I think I didn't when I preprocess

NanoCode012 commented 7 months ago

btw, is CUDA_VISIBLE_DEVICES="" necessary in doing python -m axolotl.cli.preprocess examples/llama-2/ver2.0.yml? I think I didn't when I preprocess

Shouldn't be any issue.

May I ask which model size you're running? It wasn't that clear from the yaml.

rustic-snob commented 7 months ago

Shouldn't be any issue.

May I ask which model size you're running? It wasn't that clear from the yaml.

It is just Llama-2-7b-hf model w/ extra columns in embedding and lm_head.

I don't know why but after preprocessing w/o CUDA and train works well!

satpalsr commented 7 months ago

I get similar error with accelerate launch -m axolotl.cli.train llama_lora.yml --deepspeed deepspeed_configs/zero1.json

With config same in examples. Just added additionally

lora_modules_to_save:
  - embed_tokens
  - lm_head
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"
tokens: # these are delimiters
  - "<|im_start|>"
  - "<|im_end|>"

It works in other 2 cases:

If I remove deepseed
Change lora to qlora.

The error occurs after an epoch is complete

winglian commented 7 months ago

I get similar error with

accelerate launch -m axolotl.cli.train llama_lora.yml --deepspeed deepspeed_configs/zero1.json

With config same in examples.

Just added additionally
lora_modules_to_save:

  - embed_tokens

  - lm_head

special_tokens:

  bos_token: "<s>"

  eos_token: "</s>"

  unk_token: "<unk>"

tokens: # these are delimiters

  - "<|im_start|>"

  - "<|im_end|>"
It works in other 2 cases:

If I remove deepseed

Change lora to qlora.

The error occurs after an epoch is complete

In your case, it's usually out of system RAM when it's gathering the weights from the various gpus

satpalsr commented 7 months ago

@winglian yeah the exit code is -9, that probably relates to system RAM OOM issue, but why that would happen even though I had 800GB free RAM.

e-p-armstrong commented 2 months ago

Sorry for the necro, but how do you solve this issue if renting compute?

daniel-kukiela commented 2 months ago

I have the same problem.

The thing I noticed is this only happens after I resume training from a checkpoint, never during the first run (although I can see how this could also happen during the normal run), and it happens during saving a checkpoint (when the model is transferred from the GPU to the system memory). The problem is that we run out of system RAM and the OS kills the process to save itself (otherwise it would crash) - this is a normal behavior of the OS, but the question is why this happens.

If I start training, I can train with no problem (although, again, this is my case, I can see how others can have this problem even during this stage). In the below image, you can see the system RAM usage. The "spikes" are when the checkpoint is being saved (it's set to every 100 steps because of this issue so we do not lose too much training when it happens) and the thing to notice is sometimes uses more RAM for several steps and drops down again: And I can train for how many steps I like. But once I stop the training and restart it from the last checkpoint: It, for some reason, uses more RAM to start and during the whole training, then, on top of this, also has these moments when it consumes more RAM, up to the point when the memory usage rises again and runs out of system RAM.

It seems like something in the system memory is not being cleaned up properly. The charts suggest that possibly:

before resuming - it "sometimes" does not remove the model from the memory after saving a checkpoint
after resuming - it possibly additionally does not remove the loaded model (it seems to be creating the main model from the modeling code even if it's then loading the model from the checkpoint? And possibly does not free up this memory?).

I'm using multi-GPU training with DeepSpeed ZeRO3 (I'm not using any CPU offload) and training part of the model in this case.

axolotl-ai-cloud / axolotl