ChatGLM3 series: Open Bilingual Chat LLMs | 开源双语对话语言模型
进行p-tuning-v2微调时,报如下错误 #1237

Closed cskaoyan closed 2 months ago

cskaoyan commented 2 months ago

System Info / 系統信息

2024-05-28 02:13:29.805993: I tensorflow/core/util/] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
[Additional TensorFlow/CUDA warnings omitted]
Loading checkpoint shards: 100% 7/7 [06:11<00:00, 53.12s/it]
Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at ../../chatglm3-6b and are newly initialized: ['transformer.prefix_encoder.embedding.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
--> Model

--> model has 1.835008M params

Setting num_proc from 16 back to 1 for the train split to disable multiprocessing as it only contains one shard.
Generating train split: 114599 examples [00:02, 53207.11 examples/s]
Setting num_proc from 16 back to 1 for the validation split to disable multiprocessing as it only contains one shard.
Generating validation split: 1070 examples [00:00, 158672.93 examples/s]
Setting num_proc from 16 back to 1 for the test split to disable multiprocessing as it only contains one shard. Generating test split: 1070 examples [00:00, 158521.61 examples/s]
Map (num_proc=16): 100% 114599/114599 [00:03<00:00, 32596.84 examples/s]
train_dataset: Dataset({
    features: ['input_ids', 'labels'],
    num_rows: 114599
})
Map (num_proc=16): 100% 1070/1070 [00:00<00:00, 1337.98 examples/s]
val_dataset: Dataset({
    features: ['input_ids', 'output_ids'],
    num_rows: 1070
})
Map (num_proc=16): 100% 1070/1070 [00:00<00:00, 1350.87 examples/s]
test_dataset: Dataset({
    features: ['input_ids', 'output_ids'],
    num_rows: 1070
})
--> Sanity check
'[gMASK]': 64790 -> -100
'sop': 64792 -> -100
'<|user|>': 64795 -> -100
[Additional sanity check output omitted for brevity] 31010 -> -100 '宽松': 40833 -> -100 '': 30998 -> -100 '风格': 32799 -> -100 '#': 31010 -> -100 '性感': 40589 -> -100 '': 30998 -> -100 '图案': 37505 -> -100 '#': 31010 -> -100 '线条': 37216 -> -100 '': 30998 -> -100 '裤': 56532 -> -100 '型': 54888 -> -100 '#': 31010 -> -100 '阔': 56529 -> -100 '腿': 56158 -> -100 '裤': 56532 -> -100 '<|assistant|>': 64796 -> -100 '': 30910 -> 30910 '\n': 13 -> 13 '': 30910 -> 30910 '宽松': 40833 -> 40833 '的': 54530 -> 54530 '阔': 56529 -> 56529 '腿': 56158 -> 56158 '裤': 56532 -> 56532 '这': 54551 -> 54551 '两年': 33808 -> 33808 '真的': 32041 -> 32041 '吸': 55360 -> 55360 '粉': 55486 -> 55486 '不少': 32138 -> 32138 ',': 31123 -> 31123 '明星': 32943 -> 32943 '时尚': 33481 -> 33481 '达': 54880 -> 54880 '人的': 31664 -> 31664 '心头': 46565 -> 46565 '爱': 54799 -> 54799 '。': 31155 -> 31155 '毕竟': 33051 -> 33051 '好': 54591 -> 54591 '穿': 55432 -> 55432 '时尚': 33481 -> 33481 ',': 31123 -> 31123 '谁': 55622 -> 55622 '都能': 32904 -> 32904 '穿': 55432 -> 55432 '出': 54557 -> 54557 '腿': 56158 -> 56158 '长': 54625 -> 54625 '2': 30943 -> 30943 '米': 55055 -> 55055 '的效果': 35590 -> 35590 '宽松': 40833 -> 40833 '的': 54530 -> 54530 '裤': 56532 -> 56532 '腿': 56158 -> 56158 ',': 31123 -> 31123 '当然是': 48466 -> 48466 '遮': 57148 -> 57148 '肉': 55343 -> 55343 '小': 54603 -> 54603 '能手': 49355 -> 49355 '啊': 55674 -> 55674 '。': 31155 -> 31155 '上身': 51605 -> 51605 '随': 55119 -> 55119 '性': 54642 -> 54642 '自然': 31799 -> 31799 '不': 54535 -> 54535 '拘': 57036 -> 57036 '束': 55625 -> 55625 ',': 31123 -> 31123 '面料': 46839 -> 46839 '亲': 55113 -> 55113 '肤': 56089 -> 56089 '舒适': 33894 -> 33894 '贴': 55778 -> 55778 '身体': 31902 -> 31902 '验': 55017 -> 55017 '感': 54706 -> 54706 '棒': 56382 -> 56382 '棒': 56382 -> 56382 '哒': 59230 -> 59230 '。': 31155 -> 31155 '系': 54712 -> 54712 '带': 54882 -> 54882 '部分': 31726 -> 31726 '增加': 31917 -> 31917 '设计': 31735 -> 31735 '看点': 45032 -> 45032 ',': 31123 -> 31123 '还': 54656 -> 54656 '让': 54772 -> 54772 '单品': 46539 -> 46539 '的设计': 34481 -> 34481 '感': 54706 -> 54706 '更强': 43084 -> 43084 '。': max_steps is given, it will override any value given in num_train_epochs
[2024-05-28 02:20:19,966] [INFO] [] Setting ds_accelerator to cuda (auto detect)
Running training
  Num examples = 114,599
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 3,000
  Number of trainable parameters = 1,835,008
0% 0/3000 [00:00<?, ?it/s]
{'loss': 4.848, 'grad_norm': 0.044002484530210495, 'learning_rate': 4.9833333333333336e-05, 'epoch': 0.0}
[Training logs continue...]
17% 500/3000 [06:29<36:35, 1.14it/s]
Running Evaluation
  Num examples = 50
  Batch size = 16
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /usr/local/lib/python3.10/dist-packages/accelerate/utils/ in send_to_device     │
│                                                                                                  │
│   155 │   │   if is_torch_tensor(tensor) and tensor.device.type in ["mlu"] and tensor.dtype in   │
│   156 │   │   │   tensor = tensor.cpu()                                                          │
│   157 │   │   try:                                                                               │
│ ❱ 158 │   │   │   return, non_blocking=non_blocking)                │
│   159 │   │   except TypeError:  # .to() doesn't accept non_blocking as kwarg                    │
│   160 │   │   │   return                                                        │
│   161 │                                                                                          │
│   except AssertionError as error:                                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: got an unexpected keyword argument 'non_blocking' 'grad_norm': 0.04238777607679367, 'learning_rate': 4.866666666666667e-05, 'epoch': 0.0} {'loss': 4.7391, 'grad_norm': 0.04787901043891907, 'learning_rate': 4.85e-05, 'epoch': 0.0} {'loss': 4.877, 'grad_norm': 0.04371006414294243, 'learning_rate': 4.8333333333333334e-05, 'epoch': 0.0} {'loss': 4.7984, 'grad_norm': 0.048314932733774185, 'learning_rate': 4.8166666666666674e-05, 'epoch': 0.0} {'loss': 4.9926, 'grad_norm': 0.04376620426774025, 'learning_rate': 4.8e-05, 'epoch': 0.0} {'loss': 4.8383, 'grad_norm': 0.043217916041612625, 'learning_rate': 4.7833333333333335e-05, 'epoch': 0.0} {'loss': 4.902, 'grad_norm': 0.05563262477517128, 'learning_rate': 4.766666666666667e-05, 'epoch': 0.0} {'loss': 4.8531, 'grad_norm': 0.040262993425130844, 'learning_rate': 4.75e-05, 'epoch': 0.01} {'loss': 4.9492, 'grad_norm': 0.04491027817130089, 'learning_rate': 4.7333333333333336e-05, 'epoch': 0.01} {'loss': 4.8504, 'grad_norm': 0.04221414402127266, 'learning_rate': 4.716666666666667e-05, 'epoch': 0.01} {'loss': 4.7629, 'grad_norm': 0.04584379121661186, 'learning_rate': 4.7e-05, 'epoch': 0.01} {'loss': 4.8762, 'grad_norm': 0.04659867659211159, 'learning_rate': 4.683333333333334e-05, 'epoch': 0.01} {'loss': 4.8695, 'grad_norm': 0.044780705124139786, 'learning_rate': 4.666666666666667e-05, 'epoch': 0.01} {'loss': 4.7547, 'grad_norm': 0.04107285290956497, 'learning_rate': 4.6500000000000005e-05, 'epoch': 0.01} {'loss': 4.8383, 'grad_norm': 0.034550223499536514, 'learning_rate': 4.633333333333333e-05, 'epoch': 0.01} {'loss': 4.7816, 'grad_norm': 0.040182407945394516, 'learning_rate': 4.6166666666666666e-05, 'epoch': 0.01} {'loss': 4.7193, 'grad_norm': 0.04085628315806389, 'learning_rate': 4.600000000000001e-05, 'epoch': 0.01} {'loss': 4.6914, 'grad_norm': 0.04454224929213524, 'learning_rate': 4.5833333333333334e-05, 'epoch': 0.01} {'loss': 4.7996, 'grad_norm': 0.048691339790821075, 'learning_rate': 4.566666666666667e-05, 'epoch': 0.01} {'loss': 4.8035, 'grad_norm': 0.049234725534915924, 'learning_rate': 4.55e-05, 'epoch': 0.01} {'loss': 4.8773, 'grad_norm': 0.045386116951704025, 'learning_rate': 4.5333333333333335e-05, 'epoch': 0.01} {'loss': 4.8859, 'grad_norm': 0.03528638556599617, 'learning_rate': 4.516666666666667e-05, 'epoch': 0.01} {'loss': 4.8734, 'grad_norm': 0.04855549335479736, 'learning_rate': 4.5e-05, 'epoch': 0.01} {'loss': 4.6406, 'grad_norm': 0.04950821399688721, 'learning_rate': 4.483333333333333e-05, 'epoch': 0.01} {'loss': 4.902, 'grad_norm': 0.045099250972270966, 'learning_rate': 4.466666666666667e-05, 'epoch': 0.01} {'loss': 4.6484, 'grad_norm': 0.04783552139997482, 'learning_rate': 4.4500000000000004e-05, 'epoch': 0.01} {'loss': 4.7621, 'grad_norm': 0.04738068953156471, 'learning_rate': 4.433333333333334e-05, 'epoch': 0.01} {'loss': 4.7652, 'grad_norm': 0.04288496822118759, 'learning_rate': 4.4166666666666665e-05, 'epoch': 0.01} {'loss': 4.9242, 'grad_norm': 0.0427871011197567, 'learning_rate': 4.4000000000000006e-05, 'epoch': 0.01} {'loss': 4.6539, 'grad_norm': 0.041646551340818405, 'learning_rate': 4.383333333333334e-05, 'epoch': 0.01} {'loss': 4.7441, 'grad_norm': 0.034412700682878494, 'learning_rate': 4.3666666666666666e-05, 'epoch': 0.01} {'loss': 4.7668, 'grad_norm': 0.044905249029397964, 'learning_rate': 4.35e-05, 'epoch': 0.01} {'loss': 4.734, 'grad_norm': 0.041189733892679214, 'learning_rate': 4.3333333333333334e-05, 'epoch': 0.01} {'loss': 5.0141, 'grad_norm': 0.04259607195854187, 'learning_rate': 4.316666666666667e-05, 'epoch': 0.01} {'loss': 4.657, 'grad_norm': 0.03369728848338127, 'learning_rate': 4.3e-05, 'epoch': 0.01} {'loss': 4.8316, 'grad_norm': 0.04443185403943062, 'learning_rate': 4.2833333333333335e-05, 'epoch': 0.02} {'loss': 4.6871, 'grad_norm': 0.05172109231352806, 'learning_rate': 4.266666666666667e-05, 'epoch': 0.02} {'loss': 4.7027, 'grad_norm': 0.043045446276664734, 'learning_rate': 4.25e-05, 'epoch': 0.02} {'loss': 4.648, 'grad_norm': 0.04251503199338913, 'learning_rate': 4.233333333333334e-05, 'epoch': 0.02} {'loss': 4.7918, 'grad_norm': 0.03760859742760658, 'learning_rate': 4.216666666666667e-05, 'epoch': 0.02} {'loss': 4.7535, 'grad_norm': 0.044650498777627945, 'learning_rate': 4.2e-05, 'epoch': 0.02} {'loss': 4.7223, 'grad_norm': 0.04396400973200798, 'learning_rate': 4.183333333333334e-05, 'epoch': 0.02} {'loss': 4.85, 'grad_norm': 0.04238654673099518, 'learning_rate': 4.166666666666667e-05, 'epoch': 0.02} 17% 500/3000 [06:29<36:35, 1.14it/s] Running Evaluation Num examples = 50 Batch size = 16 ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /usr/local/lib/python3.10/dist-packages/accelerate/utils/ in send_to_device │ │ │ │ 155 │ │ if is_torch_tensor(tensor) and tensor.device.type in ["mlu"] and tensor.dtype in │ │ 156 │ │ │ tensor = tensor.cpu() │ │ 157 │ │ try: │ │ ❱ 158 │ │ │ return, non_blocking=non_blocking) │ │ 159 │ │ except TypeError: # .to() doesn't accept non_blocking as kwarg │ │ 160 │ │ │ return │ │ 161 │ │ except AssertionError as error: │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ TypeError: got an unexpected keyword argument 'non_blocking'

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /content/drive/MyDrive/ChatGLM3/finetune_demo/ in main                              │
│                                                                                                  │
│   529 │        )                                                                                 │
│   530 │                                                                                          │
│   531 │        if auto_resume_from_checkpoint.upper() == "" or auto_resume_from_checkpoint is None:  │
│ ❱ 532 │   │        trainer.train()                                                               │
│   533 │        else:                                                                             │
│   534 │   │        output_dir = ft_config.training_args.output_dir                               │
│   535 │   │        dirlist = os.listdir(output_dir)                                              │
[Full traceback continues through multiple function calls...]
│ ❱ 800 │   │   │ = {k: for k, v in}                │
│   801 │   │   else:                                                                              │
│   802 │   │   │   logger.warning(f"Attempting to cast a BatchEncoding to type {str(device)}. T   │
│   803 │   │   return self                                                                        │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'NoneType' object has no attribute 'to'
17% 500/3000 [06:32<32:41, 1.27it/s]

1.下载chatglm3权重 2、把原来的lora.yaml微调jupyter上传colab 3、修改训练命名为!python data/AdvertiseGen_fix ../../chatglm3-6b configs/ptuning_v2.yaml 4、买了colab会员,所以硬件资源是没问题,显卡也比要求的配置高

ln410 commented 2 months ago


zRzRzRzRzRzRzR commented 2 months ago


shuye-cheung commented 2 months ago


dl-dayup commented 2 months ago

将transformers从4.41.1降到4.40.0 ,这个方法好用,已经能跑起来了

xudl33 commented 1 week ago

同样的问题,transformers降级到4.40.0也没行,报错换成TrainerState.init() got an unexpected keyword argument 'stateful_callbacks'了

resume checkpoint from checkpoint-500 Loading model from ./output/checkpoint-500. ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /llm/ChatGLM3/finetune_demo/ in main │ │ │ │ 549 │ │ │ │ model.enable_input_require_grads() │ │ 550 │ │ │ │ checkpoint_directory = os.path.join(output_dir, "checkpoint-" + str(chec │ │ 551 │ │ │ │ print("resume checkpoint from checkpoint-" + str(checkpoint_sn)) │ │ ❱ 552 │ │ │ │ trainer.train(resume_from_checkpoint=checkpoint_directory) │ │ 553 │ │ │ else: │ │ 554 │ │ │ │ trainer.train() │ │ 555 │ │ else: │ │ │ │ /root/miniconda3/envs/charglm3_finetune_demo/lib/python3.10/site-packages/transformers/trainer.p │ │ y:1833 in train │ │ │ │ 1830 │ │ │ if not is_sagemaker_mp_enabled() and not self.is_deepspeed_enabled and not s │ │ 1831 │ │ │ │ self._load_from_checkpoint(resume_from_checkpoint) │ │ 1832 │ │ │ # In case of repeating the find_executable_batch_size, setself._train_batc │ │ ❱ 1833 │ │ │ state = TrainerState.load_from_json(os.path.join(resume_from_checkpoint, TRA │ │ 1834 │ │ │ if state.train_batch_size is not None: │ │ 1835 │ │ │ │ self._train_batch_size = state.train_batch_size │ │ 1836 │ │ │ │ /root/miniconda3/envs/charglm3_finetune_demo/lib/python3.10/site-packages/transformers/trainer_c │ │ in load_from_json │ │ │ │ 120 │ │ """Create an instance from the content of json_path.""" │ │ 121 │ │ with open(json_path, "r", encoding="utf-8") as f: │ │ 122 │ │ │ text = │ │ ❱ 123 │ │ return cls(**json.loads(text)) │ │ 124 │ │ 125 │ │ 126 @dataclass │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ TypeError: TrainerState.init() got an unexpected keyword argument 'stateful_callbacks'`