datawhalechina / self-llm

《开源大模型食用指南》基于Linux环境快速部署开源大模型,更适合中国宝宝的部署教程
Apache License 2.0
6.14k stars 753 forks source link

微调Qwen1.5-0.5b报错 PermissionError: [Errno 13] Permission denied: './output/Qwen1.5\checkpoint-100' #123

Open ykallan opened 1 month ago

ykallan commented 1 month ago

训练代码

from datasets import Dataset
import pandas as pd
from transformers import AutoTokenizer, AutoModelForCausalLM, \
    DataCollatorForSeq2Seq, TrainingArguments, Trainer
import torch

from peft import LoraConfig, TaskType, get_peft_model

json_path = r"E:\nlp_about\self-llm\dataset\huanhuan.json"
df = pd.read_json(json_path)
ds = Dataset.from_pandas(df)

def process_func(example):
    MAX_LENGTH = 384  # Llama分词器会将一个中文字切分为多个token,因此需要放开一些最大长度,保证数据的完整性
    input_ids, attention_mask, labels = [], [], []
    instruction = tokenizer(f"<|im_start|>system\n现在你要扮演皇帝身边的女人--甄嬛<|im_end|>\n<|im_start|>user\n{example['instruction'] + example['input']}<|im_end|>\n<|im_start|>assistant\n",
                            add_special_tokens=False)  # add_special_tokens 不在开头加 special_tokens
    response = tokenizer(f"{example['output']}", add_special_tokens=False)
    input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id]
    attention_mask = instruction["attention_mask"] + response["attention_mask"] + [1]  # 因为eos token咱们也是要关注的所以 补充为1
    labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id]
    if len(input_ids) > MAX_LENGTH:  # 做一个截断
        input_ids = input_ids[:MAX_LENGTH]
        attention_mask = attention_mask[:MAX_LENGTH]
        labels = labels[:MAX_LENGTH]
    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels
    }

pretrained_model = "E:/nlp_about/pretrained_model/Qwen_Qwen1.5-0.5B-Chat"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model, use_fast=False, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(pretrained_model, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)

tokenized_id = ds.map(process_func, remove_columns=ds.column_names)

# target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
target_modules = ["q_proj", "k_proj", "v_proj"]

config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    target_modules=target_modules,
    inference_mode=False,  # 训练模式
    r=8,  # Lora 秩
    lora_alpha=32,  # Lora alaph,具体作用参见 Lora 原理
    lora_dropout=0.1  # Dropout 比例
)
model = get_peft_model(model, config)

model.print_trainable_parameters()

model.enable_input_require_grads()

args = TrainingArguments(
    output_dir="./output/Qwen1.5",
    per_device_train_batch_size=16,
    gradient_accumulation_steps=4,
    logging_steps=10,
    num_train_epochs=3,
    save_steps=100,
    learning_rate=1e-4,
    save_on_each_node=True,
    gradient_checkpointing=True
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_id,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
)

trainer.train()

pip版本:

datasets                      2.13.0
torch                         2.3.0
transformers                  4.37.0
peft                          0.10.1.dev0

完成100步训练后会在./output/Qwen1.5目录下生成文件权重

image

然后会报错:

C:\Users\zsodata\.conda\envs\llama\python.exe E:/nlp_about/self-llm/zzzzz_finetune/qwen1_5-0.5b/finetune_05b.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
trainable params: 1,179,648 || all params: 465,167,360 || trainable%: 0.2536
  0%|          | 0/174 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
C:\Users\zsodata\.conda\envs\llama\lib\site-packages\torch\utils\checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
C:\Users\zsodata\.conda\envs\llama\lib\site-packages\transformers\models\qwen2\modeling_qwen2.py:698: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
  6%|▌         | 10/174 [00:09<02:15,  1.21it/s]{'loss': 4.6898, 'learning_rate': 9.425287356321839e-05, 'epoch': 0.17}
 11%|█▏        | 20/174 [00:17<01:57,  1.31it/s]{'loss': 4.3714, 'learning_rate': 8.850574712643679e-05, 'epoch': 0.34}
 17%|█▋        | 30/174 [00:25<01:46,  1.35it/s]{'loss': 4.1143, 'learning_rate': 8.275862068965517e-05, 'epoch': 0.51}
 23%|██▎       | 40/174 [00:33<01:46,  1.26it/s]{'loss': 4.0417, 'learning_rate': 7.701149425287356e-05, 'epoch': 0.68}
 29%|██▊       | 50/174 [00:41<01:35,  1.30it/s]{'loss': 3.9572, 'learning_rate': 7.126436781609196e-05, 'epoch': 0.85}
 34%|███▍      | 60/174 [00:49<01:33,  1.22it/s]{'loss': 3.9067, 'learning_rate': 6.551724137931034e-05, 'epoch': 1.03}
 40%|████      | 70/174 [00:57<01:24,  1.23it/s]{'loss': 3.8744, 'learning_rate': 5.977011494252874e-05, 'epoch': 1.2}
 46%|████▌     | 80/174 [01:05<01:14,  1.26it/s]{'loss': 3.8866, 'learning_rate': 5.402298850574713e-05, 'epoch': 1.37}
 52%|█████▏    | 90/174 [01:13<01:05,  1.28it/s]{'loss': 3.8504, 'learning_rate': 4.827586206896552e-05, 'epoch': 1.54}
 57%|█████▋    | 100/174 [01:21<00:58,  1.26it/s]{'loss': 3.8403, 'learning_rate': 4.252873563218391e-05, 'epoch': 1.71}
C:\Users\zsodata\.conda\envs\llama\lib\site-packages\peft-0.10.1.dev0-py3.10.egg\peft\utils\save_and_load.py:195: UserWarning: Could not find a config file in E:/nlp_about/pretrained_model/Qwen_Qwen1.5-0.5B-Chat - will assume that the vocabulary was not modified.
  warnings.warn(
Traceback (most recent call last):
  File "E:\nlp_about\self-llm\zzzzz_finetune\qwen1_5-0.5b\finetune_05b.py", line 79, in <module>
    trainer.train()
  File "C:\Users\zsodata\.conda\envs\llama\lib\site-packages\transformers\trainer.py", line 1539, in train
    return inner_training_loop(
  File "C:\Users\zsodata\.conda\envs\llama\lib\site-packages\transformers\trainer.py", line 1929, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "C:\Users\zsodata\.conda\envs\llama\lib\site-packages\transformers\trainer.py", line 2300, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "C:\Users\zsodata\.conda\envs\llama\lib\site-packages\transformers\trainer.py", line 2418, in _save_checkpoint
    fd = os.open(output_dir, os.O_RDONLY)
PermissionError: [Errno 13] Permission denied: './output/Qwen1.5\\checkpoint-100'
 57%|█████▋    | 100/174 [01:21<01:00,  1.22it/s]

Process finished with exit code 1
ykallan commented 1 month ago

初步怀疑是windows电脑 c盘没有容量了,等我回去换个电脑试一试

ykallan commented 1 month ago

初步怀疑是windows电脑 c盘没有容量了,等我回去换个电脑试一试

清理了C盘容量后还是报相同的错误

KMnO4-zx commented 1 month ago

不建议在windows环境下使用本教程

ykallan commented 1 month ago

不建议在windows环境下使用本教程

感谢回复,已经在另外一台电脑上,使用相同代码跑起来了,同样是win系统

KMnO4-zx commented 1 month ago

不建议使用windows系统,是因为windows环境充满了不确定性,同样的报错也可能来自不同的原因,所以不建议使用windows学习~