GLM4-9b-chat 结合 PEFT 中的 Xlora 训练时, 无法从保存点继续恢复训练

System Info / 系統信息

Peft v0.13.2 Transformers v4.44.0 Accelerate v0.33.0

Who can help? / 谁可以帮助到您？

No response

Information / 问题信息

[X] The official example scripts / 官方的示例脚本
[X] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

我尝试使用glm4-9b 中的fintune.py 结合peft中的xlora进行模型训练. finetune.py没有做任何的修改, 以下是我的xlora.yaml文件:

data_config:
  train_file: train.jsonl
  val_file: dev.jsonl
  test_file: dev.jsonl
  num_proc: 1

combine: True
freezeV: True
max_input_length: 512
max_output_length: 512

training_args:
  # see `transformers.Seq2SeqTrainingArguments`
  output_dir: ./output_1026
  max_steps: 20000
  # needed to be fit for the dataset
  learning_rate: 3e-4
  # settings for data loading
  per_device_train_batch_size: 1
  dataloader_num_workers: 16
  remove_unused_columns: false
  # settings for saving checkpoints
  save_strategy: steps
  save_steps: 5
  # settings for logging
  log_level: info
  logging_strategy: steps
  logging_steps: 5
  # settings for evaluation
  per_device_eval_batch_size: 4
  eval_strategy: steps
  eval_steps: 2000
  # settings for optimizer
  adam_epsilon: 1e-6
  # uncomment the following line to detect nan or inf values
  # debug: underflow_overflow
  predict_with_generate: true
  # see `transformers.GenerationConfig`
  generation_config:
    max_new_tokens: 512
  # set your absolute deepspeed path here
    # deepspeed: configs/ds_zero_3.json

peft_config:
  peft_type: XLORA
  task_type: CAUSAL_LM
  hidden_size : 4096
  xlora_depth : 1
  adapters : {
    "adapter_0": "/home/hs/hs/finetune_demo/output_MechanicsMaterials_New/checkpoint-5000/",
    "adapter_1": "/home/hs/hs/finetune_demo/output_biology/checkpoint-4000/",
  }

其中的adapter_0和adapter_1是我使用glm4-9b以及finetune.py训练的lora adapter. 目前我在结合xlora训练时候的时候是可以保存checkpoint的, 但是当我在从last checkpoint 恢复训练的时候就会发生报错. 以下是完整的报错信息:

Loading checkpoint shards:   0%|          | 0/10 [00:00<?, ?it/s]
Loading checkpoint shards:  10%|█         | 1/10 [00:00<00:01,  6.86it/s]
Loading checkpoint shards:  20%|██        | 2/10 [00:00<00:01,  6.89it/s]
Loading checkpoint shards:  30%|███       | 3/10 [00:00<00:01,  6.90it/s]
Loading checkpoint shards:  40%|████      | 4/10 [00:00<00:00,  6.43it/s]
Loading checkpoint shards:  50%|█████     | 5/10 [00:00<00:00,  6.61it/s]
Loading checkpoint shards:  60%|██████    | 6/10 [00:00<00:00,  6.72it/s]
Loading checkpoint shards:  70%|███████   | 7/10 [00:01<00:00,  6.80it/s]
Loading checkpoint shards:  80%|████████  | 8/10 [00:01<00:00,  6.85it/s]
Loading checkpoint shards:  90%|█████████ | 9/10 [00:01<00:00,  6.88it/s]
Loading checkpoint shards: 100%|██████████| 10/10 [00:01<00:00,  6.96it/s]
Loading checkpoint shards: 100%|██████████| 10/10 [00:01<00:00,  6.82it/s]

  0%|          | 0/2 [00:00<?, ?it/s]
 50%|█████     | 1/2 [00:06<00:06,  6.33s/it]
100%|██████████| 2/2 [00:06<00:00,  3.22s/it]
Froze 160 adapters.
LoRA -> xLoRA complete: Swapped 40 LoRA layers (out of 971 modules).
trainable params: 67,145,732 || all params: 9,472,667,652 || trainable%: 0.7088

Map:   0%|          | 0/14803 [00:00<?, ? examples/s]
Map:   7%|▋         | 1000/14803 [00:03<00:42, 327.62 examples/s]
Map:  14%|█▎        | 2000/14803 [00:05<00:36, 347.77 examples/s]
Map:  20%|██        | 3000/14803 [00:08<00:33, 356.42 examples/s]
Map:  27%|██▋       | 4000/14803 [00:11<00:29, 361.64 examples/s]
Map:  34%|███▍      | 5000/14803 [00:13<00:26, 363.21 examples/s]
Map:  41%|████      | 6000/14803 [00:16<00:24, 363.04 examples/s]
Map:  47%|████▋     | 7000/14803 [00:19<00:21, 363.56 examples/s]
Map:  54%|█████▍    | 8000/14803 [00:21<00:16, 413.31 examples/s]
Map:  61%|██████    | 9000/14803 [00:22<00:11, 504.23 examples/s]
Map:  68%|██████▊   | 10000/14803 [00:23<00:08, 597.65 examples/s]
Map:  74%|███████▍  | 11000/14803 [00:24<00:05, 681.07 examples/s]
Map:  81%|████████  | 12000/14803 [00:25<00:03, 754.44 examples/s]
Map:  88%|████████▊ | 13000/14803 [00:26<00:02, 809.81 examples/s]
Map:  95%|█████████▍| 14000/14803 [00:27<00:00, 856.43 examples/s]
Map: 100%|██████████| 14803/14803 [00:28<00:00, 890.57 examples/s]
Map: 100%|██████████| 14803/14803 [00:28<00:00, 528.34 examples/s]
train_dataset: Dataset({
    features: ['input_ids', 'labels'],
    num_rows: 14803
})

Map:   0%|          | 0/2 [00:00<?, ? examples/s]
Map: 100%|██████████| 2/2 [00:00<00:00, 187.78 examples/s]
val_dataset: Dataset({
    features: ['input_ids', 'output_ids'],
    num_rows: 2
})

Map:   0%|          | 0/2 [00:00<?, ? examples/s]
Map: 100%|██████████| 2/2 [00:00<00:00, 189.77 examples/s]
test_dataset: Dataset({
    features: ['input_ids', 'output_ids'],
    num_rows: 2
})
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs
resume checkpoint from checkpoint-20
Loading model from ./output_new/checkpoint-20.
Multiple active adapters detected will only consider the first adapter
[2024-10-15 18:47:30,968] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/zhangjunyi/anaconda3/lib/python3.11/site-packages/transformers/trainer.py:3098: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
[rank0]: ╭───────────────────── Traceback (most recent call last) ──────────────────────╮
[rank0]: │ /home/zhangjunyi/hs_test/finetune_demo/finetune.py:615 in main               │
[rank0]: │                                                                              │
[rank0]: │   612 │   │   │   │   model.enable_input_require_grads()                     │
[rank0]: │   613 │   │   │   │   checkpoint_directory = os.path.join(output_dir, "check │
[rank0]: │   614 │   │   │   │   print("resume checkpoint from checkpoint-" + str(check │
[rank0]: │ ❱ 615 │   │   │   │   trainer.train(resume_from_checkpoint=checkpoint_direct │
[rank0]: │   616 │   │   │   else:                                                      │
[rank0]: │   617 │   │   │   │   trainer.train()                                        │
[rank0]: │   618 │   │   else:                                                          │
[rank0]: │                                                                              │
[rank0]: │ /home/zhangjunyi/anaconda3/lib/python3.11/site-packages/transformers/trainer │
[rank0]: │ .py:1938 in train                                                            │
[rank0]: │                                                                              │
[rank0]: │   1935 │   │   │   finally:                                                  │
[rank0]: │   1936 │   │   │   │   hf_hub_utils.enable_progress_bars()                   │
[rank0]: │   1937 │   │   else:                                                         │
[rank0]: │ ❱ 1938 │   │   │   return inner_training_loop(                               │
[rank0]: │   1939 │   │   │   │   args=args,                                            │
[rank0]: │   1940 │   │   │   │   resume_from_checkpoint=resume_from_checkpoint,        │
[rank0]: │   1941 │   │   │   │   trial=trial,                                          │
[rank0]: │                                                                              │
[rank0]: │ /home/zhangjunyi/anaconda3/lib/python3.11/site-packages/transformers/trainer │
[rank0]: │ .py:2126 in _inner_training_loop                                             │
[rank0]: │                                                                              │
[rank0]: │   2123 │   │   │   │   self._load_from_checkpoint(resume_from_checkpoint, se │
[rank0]: │   2124 │   │                                                                 │
[rank0]: │   2125 │   │   # Check if saved optimizer or scheduler states exist          │
[rank0]: │ ❱ 2126 │   │   self._load_optimizer_and_scheduler(resume_from_checkpoint)    │
[rank0]: │   2127 │   │                                                                 │
[rank0]: │   2128 │   │   # important: at this point:                                   │
[rank0]: │   2129 │   │   # self.model         is the Transformers Model                │
[rank0]: │                                                                              │
[rank0]: │ /home/zhangjunyi/anaconda3/lib/python3.11/site-packages/transformers/trainer │
[rank0]: │ .py:3097 in _load_optimizer_and_scheduler                                    │
[rank0]: │                                                                              │
[rank0]: │   3094 │   │   │   │   │   │   │   **_get_fsdp_ckpt_kwargs(),                │
[rank0]: │   3095 │   │   │   │   │   │   )                                             │
[rank0]: │   3096 │   │   │   │   │   else:                                             │
[rank0]: │ ❱ 3097 │   │   │   │   │   │   self.optimizer.load_state_dict(               │
[rank0]: │   3098 │   │   │   │   │   │   │   torch.load(os.path.join(checkpoint, OPTIM │
[rank0]: │   3099 │   │   │   │   │   │   )                                             │
[rank0]: │   3100 │   │   │   │   with warnings.catch_warnings(record=True) as caught_w │
[rank0]: │                                                                              │
[rank0]: │ /home/zhangjunyi/anaconda3/lib/python3.11/site-packages/accelerate/optimizer │
[rank0]: │ .py:107 in load_state_dict                                                   │
[rank0]: │                                                                              │
[rank0]: │   104 │   def load_state_dict(self, state_dict):                             │
[rank0]: │   105 │   │   if self.accelerator_state.distributed_type == DistributedType. │
[rank0]: │   106 │   │   │   xm.send_cpu_data_to_device(state_dict, self.accelerator_st │
[rank0]: │ ❱ 107 │   │   self.optimizer.load_state_dict(state_dict)                     │
[rank0]: │   108 │                                                                      │
[rank0]: │   109 │   def state_dict(self):                                              │
[rank0]: │   110 │   │   return self.optimizer.state_dict()                             │
[rank0]: │                                                                              │
[rank0]: │ /home/zhangjunyi/anaconda3/lib/python3.11/site-packages/torch/_compile.py:31 │
[rank0]: │ in inner                                                                     │
[rank0]: │                                                                              │
[rank0]: │   28 │   │   │   │   disable_fn = torch._dynamo.disable(fn, recursive)       │
[rank0]: │   29 │   │   │   │   fn.__dynamo_disable = disable_fn                        │
[rank0]: │   30 │   │   │                                                               │
[rank0]: │ ❱ 31 │   │   │   return disable_fn(*args, **kwargs)                          │
[rank0]: │   32 │   │                                                                   │
[rank0]: │   33 │   │   return inner                                                    │
[rank0]: │   34 │   else:                                                               │
[rank0]: │                                                                              │
[rank0]: │ /home/zhangjunyi/anaconda3/lib/python3.11/site-packages/torch/_dynamo/eval_f │
[rank0]: │ rame.py:600 in _fn                                                           │
[rank0]: │                                                                              │
[rank0]: │    597 │   │   def _fn(*args, **kwargs):                                     │
[rank0]: │    598 │   │   │   prior = set_eval_frame(callback)                          │
[rank0]: │    599 │   │   │   try:                                                      │
[rank0]: │ ❱  600 │   │   │   │   return fn(*args, **kwargs)                            │
[rank0]: │    601 │   │   │   finally:                                                  │
[rank0]: │    602 │   │   │   │   set_eval_frame(prior)                                 │
[rank0]: │    603                                                                       │
[rank0]: │                                                                              │
[rank0]: │ /home/zhangjunyi/anaconda3/lib/python3.11/site-packages/torch/optim/optimize │
[rank0]: │ r.py:854 in load_state_dict                                                  │
[rank0]: │                                                                              │
[rank0]: │    851 │   │   param_lens = (len(g["params"]) for g in groups)               │
[rank0]: │    852 │   │   saved_lens = (len(g["params"]) for g in saved_groups)         │
[rank0]: │    853 │   │   if any(p_len != s_len for p_len, s_len in zip(param_lens, sav │
[rank0]: │ ❱  854 │   │   │   raise ValueError(                                         │
[rank0]: │    855 │   │   │   │   "loaded state dict contains a parameter group "       │
[rank0]: │    856 │   │   │   │   "that doesn't match the size of optimizer's group"    │
[rank0]: │    857 │   │   │   )                                                         │
[rank0]: ╰──────────────────────────────────────────────────────────────────────────────╯
[rank0]: ValueError: loaded state dict contains a parameter group that doesn't match the 
[rank0]: size of optimizer's group
E1015 18:47:35.719000 139827737793152 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 4075482) of binary: /home/zhangjunyi/anaconda3/bin/python
Traceback (most recent call last):
  File "/home/zhangjunyi/anaconda3/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/zhangjunyi/anaconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/zhangjunyi/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/zhangjunyi/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/zhangjunyi/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhangjunyi/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================

微信截图_20241029160550 这个是我在结合xlora时保存的checkpoint

Expected behavior / 期待表现

非常感谢任何的指导去解决这个xlora检查点恢复的问题。如果有人遇到过类似的问题，或者对xlora成功启用检查点恢复的具体设置或步骤有见解，那么您的建议将是非常宝贵的。此外，如果任何熟悉xlora的维护者或社区成员能够提供支持，那将非常有帮助。非常感谢!

THUDM / GLM-4

GLM4-9b-chat 结合 PEFT 中的 Xlora 训练时, 无法从保存点继续恢复训练 #617

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现