[BUG] 训练完模型保存报错

o-github-o commented 7 months ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

使用https://github.com/QwenLM/Qwen/finetune.py 中的qlora方式微调Qwen-7B-Chat 训练过程中没有问题，保存模型报错

File "finetune.py", line 366, in train trainer.train() File "/home/xxx/anaconda3090/envs/zach_script_recommendation/lib/python3.8/site-packages/transformers/trainer.py", line 1624, in train return inner_training_loop( File "/home/xxx/anaconda3090/envs/zach_script_recommendation/lib/python3.8/site-packages/transformers/trainer.py", line 2029, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) File "/home/xxx/anaconda3090/envs/zach_script_recommendation/lib/python3.8/site-packages/transformers/trainer.py", line 2423, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/home/xxx/anaconda3090/envs/zach_script_recommendation/lib/python3.8/site-packages/transformers/trainer.py", line 2525, in _save_checkpoint self.state.save_to_json(os.path.join(staging_output_dir, TRAINER_STATE_NAME)) File "/home/xxx/anaconda3090/envs/zach_script_recommendation/lib/python3.8/site-packages/transformers/trainer_callback.py", line 113, in save_to_json json_string = json.dumps(dataclasses.asdict(self), indent=2, sort_keys=True) + "\n" File "/home/xxx/anaconda3090/envs/zach_script_recommendation/lib/python3.8/json/init.py", line 234, in dumps return cls( File "/home/xxx/anaconda3090/envs/zach_script_recommendation/lib/python3.8/json/encoder.py", line 201, in encode chunks = list(chunks) File "/home/xiezizhe/anaconda3090/envs/zach_script_recommendation/lib/python3.8/json/encoder.py", line 431, in _iterencode yield from _iterencode_dict(o, _current_indent_level) File "/home/xxx/anaconda3090/envs/zach_script_recommendation/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict yield from chunks File "/home/xxx/anaconda3090/envs/zach_script_recommendation/lib/python3.8/json/encoder.py", line 325, in _iterencode_list yield from chunks File "/home/xxx/anaconda3090/envs/zach_script_recommendation/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict yield from chunks File "/home/xxx/anaconda3090/envs/zach_script_recommendation/lib/python3.8/json/encoder.py", line 438, in _iterencode o = _default(o) File "/home/xxx/anaconda3090/envs/zach_script_recommendation/lib/python3.8/json/encoder.py", line 179, in default raise TypeError(f'Object of type {o.class.name} ' TypeError: Object of type Tensor is not JSON serializable

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

o-github-o commented 7 months ago

deepspeed==0.13.1 能解决保存的问题

jklj077 commented 7 months ago

To better assist the community and facilitate troubleshooting for similar issues, it's crucial to specify the versions of the following libraries that you were utilizing prior to encountering the problem given their interdependent nature:

为了更好地帮助社区并促进对类似问题的故障排查，鉴于这些库之间的相互依赖性，在遇到问题之前，明确您所使用的以下库的版本至关重要：

peft
transformers
accelerate
deepspeed

Please provide this information so others can draw parallels and potentially identify the root cause more effectively.

请提供这些信息，以便其他人能够对照比较，并可能更有效地识别根本原因。

yuanzhoulvpi2017 commented 6 months ago

可以查看我的这个解决方案https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/train_qwen#%E4%BF%AE%E6%94%B9%E7%9A%84%E7%9B%AE%E7%9A%84

主要是transformers的trainer类的_maybe_log_save_evaluate方法出现问题。
在这个方法里面，trainer尝试保存grad_norm变量，但是这个变量是tensor类型，不可以被json序列化，因此报错。
参考我的代码里面的HzTrainer类即可解决https://github.com/yuanzhoulvpi2017/zero_nlp/blob/4fb3e8fb12b24c9d469ca88bee83e764c90bda8b/train_qwen/train_qwen2_sft.py#L237

jklj077 commented 6 months ago

@yuanzhoulvpi2017 Thanks for the information. For now, the simplest solution to work-around this issue seems to be downgrade transformers and ensure that "transformers<4.38.0".

感谢提供的信息。目前来看，解决该问题的最简单方案是降级transformers使其版本低于4.38.0。

QwenLM / Qwen