进行p-tuning-v2微调时，报如下错误

cskaoyan commented 2 months ago

System Info / 系統信息

2024-05-28 02:13:29.805993: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2024-05-28 02:13:29.860367: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-05-28 02:13:29.860423: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-05-28 02:13:29.862445: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-05-28 02:13:29.871211: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-05-28 02:13:31.034588: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT /usr/local/lib/python3.10/dist-packages/transformers/training_args.py:1474: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead warnings.warn( Setting eos_token is not supported, use the default one. Setting pad_token is not supported, use the default one. Setting unk_token is not supported, use the default one. Loading checkpoint shards: 100% 7/7 [06:11<00:00, 53.12s/it] Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at ../../chatglm3-6b and are newly initialized: ['transformer.prefix_encoder.embedding.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. --> Model

--> model has 1.835008M params

/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = os.fork() /usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = os.fork() Setting num_proc from 16 back to 1 for the train split to disable multiprocessing as it only contains one shard. Generating train split: 114599 examples [00:02, 53207.11 examples/s] Setting num_proc from 16 back to 1 for the validation split to disable multiprocessing as it only contains one shard. Generating validation split: 1070 examples [00:00, 158672.93 examples/s] Setting num_proc from 16 back to 1 for the test split to disable multiprocessing as it only contains one shard. Generating test split: 1070 examples [00:00, 158521.61 examples/s] /usr/local/lib/python3.10/dist-packages/multiprocess/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = os.fork() Map (num_proc=16): 100% 114599/114599 [00:03<00:00, 32596.84 examples/s] train_dataset: Dataset({ features: ['input_ids', 'labels'], num_rows: 114599 }) Map (num_proc=16): 100% 1070/1070 [00:00<00:00, 1337.98 examples/s] val_dataset: Dataset({ features: ['input_ids', 'output_ids'], num_rows: 1070 }) Map (num_proc=16): 100% 1070/1070 [00:00<00:00, 1350.87 examples/s] test_dataset: Dataset({ features: ['input_ids', 'output_ids'], num_rows: 1070 }) --> Sanity check '[gMASK]': 64790 -> -100 'sop': 64792 -> -100 '<|user|>': 64795 -> -100 '': 30910 -> -100 '\n': 13 -> -100 '': 30910 -> -100 '类型': 33467 -> -100 '#': 31010 -> -100 '裤': 56532 -> -100 '': 30998 -> -100 '版': 55090 -> -100 '型': 54888 -> -100 '#': 31010 -> -100 '宽松': 40833 -> -100 '': 30998 -> -100 '风格': 32799 -> -100 '#': 31010 -> -100 '性感': 40589 -> -100 '': 30998 -> -100 '图案': 37505 -> -100 '#': 31010 -> -100 '线条': 37216 -> -100 '': 30998 -> -100 '裤': 56532 -> -100 '型': 54888 -> -100 '#': 31010 -> -100 '阔': 56529 -> -100 '腿': 56158 -> -100 '裤': 56532 -> -100 '<|assistant|>': 64796 -> -100 '': 30910 -> 30910 '\n': 13 -> 13 '': 30910 -> 30910 '宽松': 40833 -> 40833 '的': 54530 -> 54530 '阔': 56529 -> 56529 '腿': 56158 -> 56158 '裤': 56532 -> 56532 '这': 54551 -> 54551 '两年': 33808 -> 33808 '真的': 32041 -> 32041 '吸': 55360 -> 55360 '粉': 55486 -> 55486 '不少': 32138 -> 32138 '，': 31123 -> 31123 '明星': 32943 -> 32943 '时尚': 33481 -> 33481 '达': 54880 -> 54880 '人的': 31664 -> 31664 '心头': 46565 -> 46565 '爱': 54799 -> 54799 '。': 31155 -> 31155 '毕竟': 33051 -> 33051 '好': 54591 -> 54591 '穿': 55432 -> 55432 '时尚': 33481 -> 33481 '，': 31123 -> 31123 '谁': 55622 -> 55622 '都能': 32904 -> 32904 '穿': 55432 -> 55432 '出': 54557 -> 54557 '腿': 56158 -> 56158 '长': 54625 -> 54625 '2': 30943 -> 30943 '米': 55055 -> 55055 '的效果': 35590 -> 35590 '宽松': 40833 -> 40833 '的': 54530 -> 54530 '裤': 56532 -> 56532 '腿': 56158 -> 56158 '，': 31123 -> 31123 '当然是': 48466 -> 48466 '遮': 57148 -> 57148 '肉': 55343 -> 55343 '小': 54603 -> 54603 '能手': 49355 -> 49355 '啊': 55674 -> 55674 '。': 31155 -> 31155 '上身': 51605 -> 51605 '随': 55119 -> 55119 '性': 54642 -> 54642 '自然': 31799 -> 31799 '不': 54535 -> 54535 '拘': 57036 -> 57036 '束': 55625 -> 55625 '，': 31123 -> 31123 '面料': 46839 -> 46839 '亲': 55113 -> 55113 '肤': 56089 -> 56089 '舒适': 33894 -> 33894 '贴': 55778 -> 55778 '身体': 31902 -> 31902 '验': 55017 -> 55017 '感': 54706 -> 54706 '棒': 56382 -> 56382 '棒': 56382 -> 56382 '哒': 59230 -> 59230 '。': 31155 -> 31155 '系': 54712 -> 54712 '带': 54882 -> 54882 '部分': 31726 -> 31726 '增加': 31917 -> 31917 '设计': 31735 -> 31735 '看点': 45032 -> 45032 '，': 31123 -> 31123 '还': 54656 -> 54656 '让': 54772 -> 54772 '单品': 46539 -> 46539 '的设计': 34481 -> 34481 '感': 54706 -> 54706 '更强': 43084 -> 43084 '。': 31155 -> 31155 '腿部': 46799 -> 46799 '线条': 37216 -> 37216 '若': 55351 -> 55351 '隐': 55733 -> 55733 '若': 55351 -> 55351 '现': 54600 -> 54600 '的': 54530 -> 54530 '，': 31123 -> 31123 '性感': 40589 -> 40589 '撩': 58521 -> 58521 '人': 54533 -> 54533 '。': 31155 -> 31155 '颜色': 33692 -> 33692 '敲': 57004 -> 57004 '温柔': 34678 -> 34678 '的': 54530 -> 54530 '，': 31123 -> 31123 '与': 54619 -> 54619 '裤子': 44722 -> 44722 '本身': 32754 -> 32754 '所': 54626 -> 54626 '呈现': 33169 -> 33169 '的风格': 48084 -> 48084 '有点': 33149 -> 33149 '反': 54955 -> 54955 '差': 55342 -> 55342 '萌': 56842 -> 56842 '。': 31155 -> 31155 '': 2 -> 2 max_steps is given, it will override any value given in num_train_epochs [2024-05-28 02:20:19,966] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) Running training Num examples = 114,599 Num Epochs = 1 Instantaneous batch size per device = 4 Total train batch size (w. parallel, distributed & accumulation) = 4 Gradient Accumulation steps = 1 Total optimization steps = 3,000 Number of trainable parameters = 1,835,008 0% 0/3000 [00:00<?, ?it/s]/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = os.fork() {'loss': 4.848, 'grad_norm': 0.044002484530210495, 'learning_rate': 4.9833333333333336e-05, 'epoch': 0.0} {'loss': 4.7539, 'grad_norm': 0.04938019812107086, 'learning_rate': 4.966666666666667e-05, 'epoch': 0.0} {'loss': 4.8922, 'grad_norm': 0.03882051259279251, 'learning_rate': 4.9500000000000004e-05, 'epoch': 0.0} {'loss': 4.8078, 'grad_norm': 0.049995895475149155, 'learning_rate': 4.933333333333334e-05, 'epoch': 0.0} {'loss': 4.9402, 'grad_norm': 0.03917162865400314, 'learning_rate': 4.9166666666666665e-05, 'epoch': 0.0} {'loss': 4.8363, 'grad_norm': 0.06000753119587898, 'learning_rate': 4.9e-05, 'epoch': 0.0} {'loss': 4.8016, 'grad_norm': 0.03949465975165367, 'learning_rate': 4.883333333333334e-05, 'epoch': 0.0} {'loss': 4.7992, 'grad_norm': 0.04238777607679367, 'learning_rate': 4.866666666666667e-05, 'epoch': 0.0} {'loss': 4.7391, 'grad_norm': 0.04787901043891907, 'learning_rate': 4.85e-05, 'epoch': 0.0} {'loss': 4.877, 'grad_norm': 0.04371006414294243, 'learning_rate': 4.8333333333333334e-05, 'epoch': 0.0} {'loss': 4.7984, 'grad_norm': 0.048314932733774185, 'learning_rate': 4.8166666666666674e-05, 'epoch': 0.0} {'loss': 4.9926, 'grad_norm': 0.04376620426774025, 'learning_rate': 4.8e-05, 'epoch': 0.0} {'loss': 4.8383, 'grad_norm': 0.043217916041612625, 'learning_rate': 4.7833333333333335e-05, 'epoch': 0.0} {'loss': 4.902, 'grad_norm': 0.05563262477517128, 'learning_rate': 4.766666666666667e-05, 'epoch': 0.0} {'loss': 4.8531, 'grad_norm': 0.040262993425130844, 'learning_rate': 4.75e-05, 'epoch': 0.01} {'loss': 4.9492, 'grad_norm': 0.04491027817130089, 'learning_rate': 4.7333333333333336e-05, 'epoch': 0.01} {'loss': 4.8504, 'grad_norm': 0.04221414402127266, 'learning_rate': 4.716666666666667e-05, 'epoch': 0.01} {'loss': 4.7629, 'grad_norm': 0.04584379121661186, 'learning_rate': 4.7e-05, 'epoch': 0.01} {'loss': 4.8762, 'grad_norm': 0.04659867659211159, 'learning_rate': 4.683333333333334e-05, 'epoch': 0.01} {'loss': 4.8695, 'grad_norm': 0.044780705124139786, 'learning_rate': 4.666666666666667e-05, 'epoch': 0.01} {'loss': 4.7547, 'grad_norm': 0.04107285290956497, 'learning_rate': 4.6500000000000005e-05, 'epoch': 0.01} {'loss': 4.8383, 'grad_norm': 0.034550223499536514, 'learning_rate': 4.633333333333333e-05, 'epoch': 0.01} {'loss': 4.7816, 'grad_norm': 0.040182407945394516, 'learning_rate': 4.6166666666666666e-05, 'epoch': 0.01} {'loss': 4.7193, 'grad_norm': 0.04085628315806389, 'learning_rate': 4.600000000000001e-05, 'epoch': 0.01} {'loss': 4.6914, 'grad_norm': 0.04454224929213524, 'learning_rate': 4.5833333333333334e-05, 'epoch': 0.01} {'loss': 4.7996, 'grad_norm': 0.048691339790821075, 'learning_rate': 4.566666666666667e-05, 'epoch': 0.01} {'loss': 4.8035, 'grad_norm': 0.049234725534915924, 'learning_rate': 4.55e-05, 'epoch': 0.01} {'loss': 4.8773, 'grad_norm': 0.045386116951704025, 'learning_rate': 4.5333333333333335e-05, 'epoch': 0.01} {'loss': 4.8859, 'grad_norm': 0.03528638556599617, 'learning_rate': 4.516666666666667e-05, 'epoch': 0.01} {'loss': 4.8734, 'grad_norm': 0.04855549335479736, 'learning_rate': 4.5e-05, 'epoch': 0.01} {'loss': 4.6406, 'grad_norm': 0.04950821399688721, 'learning_rate': 4.483333333333333e-05, 'epoch': 0.01} {'loss': 4.902, 'grad_norm': 0.045099250972270966, 'learning_rate': 4.466666666666667e-05, 'epoch': 0.01} {'loss': 4.6484, 'grad_norm': 0.04783552139997482, 'learning_rate': 4.4500000000000004e-05, 'epoch': 0.01} {'loss': 4.7621, 'grad_norm': 0.04738068953156471, 'learning_rate': 4.433333333333334e-05, 'epoch': 0.01} {'loss': 4.7652, 'grad_norm': 0.04288496822118759, 'learning_rate': 4.4166666666666665e-05, 'epoch': 0.01} {'loss': 4.9242, 'grad_norm': 0.0427871011197567, 'learning_rate': 4.4000000000000006e-05, 'epoch': 0.01} {'loss': 4.6539, 'grad_norm': 0.041646551340818405, 'learning_rate': 4.383333333333334e-05, 'epoch': 0.01} {'loss': 4.7441, 'grad_norm': 0.034412700682878494, 'learning_rate': 4.3666666666666666e-05, 'epoch': 0.01} {'loss': 4.7668, 'grad_norm': 0.044905249029397964, 'learning_rate': 4.35e-05, 'epoch': 0.01} {'loss': 4.734, 'grad_norm': 0.041189733892679214, 'learning_rate': 4.3333333333333334e-05, 'epoch': 0.01} {'loss': 5.0141, 'grad_norm': 0.04259607195854187, 'learning_rate': 4.316666666666667e-05, 'epoch': 0.01} {'loss': 4.657, 'grad_norm': 0.03369728848338127, 'learning_rate': 4.3e-05, 'epoch': 0.01} {'loss': 4.8316, 'grad_norm': 0.04443185403943062, 'learning_rate': 4.2833333333333335e-05, 'epoch': 0.02} {'loss': 4.6871, 'grad_norm': 0.05172109231352806, 'learning_rate': 4.266666666666667e-05, 'epoch': 0.02} {'loss': 4.7027, 'grad_norm': 0.043045446276664734, 'learning_rate': 4.25e-05, 'epoch': 0.02} {'loss': 4.648, 'grad_norm': 0.04251503199338913, 'learning_rate': 4.233333333333334e-05, 'epoch': 0.02} {'loss': 4.7918, 'grad_norm': 0.03760859742760658, 'learning_rate': 4.216666666666667e-05, 'epoch': 0.02} {'loss': 4.7535, 'grad_norm': 0.044650498777627945, 'learning_rate': 4.2e-05, 'epoch': 0.02} {'loss': 4.7223, 'grad_norm': 0.04396400973200798, 'learning_rate': 4.183333333333334e-05, 'epoch': 0.02} {'loss': 4.85, 'grad_norm': 0.04238654673099518, 'learning_rate': 4.166666666666667e-05, 'epoch': 0.02} 17% 500/3000 [06:29<36:35, 1.14it/s] Running Evaluation Num examples = 50 Batch size = 16 ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py:158 in send_to_device │ │ │ │ 155 │ │ if is_torch_tensor(tensor) and tensor.device.type in ["mlu"] and tensor.dtype in │ │ 156 │ │ │ tensor = tensor.cpu() │ │ 157 │ │ try: │ │ ❱ 158 │ │ │ return tensor.to(device, non_blocking=non_blocking) │ │ 159 │ │ except TypeError: # .to() doesn't accept non_blocking as kwarg │ │ 160 │ │ │ return tensor.to(device) │ │ 161 │ │ except AssertionError as error: │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ TypeError: BatchEncoding.to() got an unexpected keyword argument 'non_blocking'

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /content/drive/MyDrive/ChatGLM3/finetune_demo/finetune_hf.py:532 in main │ │ │ │ 529 │ ) │ │ 530 │ │ │ 531 │ if auto_resume_from_checkpoint.upper() == "" or auto_resume_from_checkpoint is None: │ │ ❱ 532 │ │ trainer.train() │ │ 533 │ else: │ │ 534 │ │ output_dir = ft_config.training_args.output_dir │ │ 535 │ │ dirlist = os.listdir(output_dir) │ │ │ │ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1885 in train │ │ │ │ 1882 │ │ │ finally: │ │ 1883 │ │ │ │ hf_hub_utils.enable_progress_bars() │ │ 1884 │ │ else: │ │ ❱ 1885 │ │ │ return inner_training_loop( │ │ 1886 │ │ │ │ args=args, │ │ 1887 │ │ │ │ resume_from_checkpoint=resume_from_checkpoint, │ │ 1888 │ │ │ │ trial=trial, │ │ │ │ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2291 in _inner_training_loop │ │ │ │ 2288 │ │ │ │ │ self.state.epoch = epoch + (step + 1 + steps_skipped) / steps_in_epo │ │ 2289 │ │ │ │ │ self.control = self.callback_handler.on_step_end(args, self.state, s │ │ 2290 │ │ │ │ │ │ │ ❱ 2291 │ │ │ │ │ self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoc │ │ 2292 │ │ │ │ else: │ │ 2293 │ │ │ │ │ self.control = self.callback_handler.on_substep_end(args, self.state │ │ 2294 │ │ │ │ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2721 in _maybe_log_save_evaluate │ │ │ │ 2718 │ │ │ │ 2719 │ │ metrics = None │ │ 2720 │ │ if self.control.should_evaluate: │ │ ❱ 2721 │ │ │ metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) │ │ 2722 │ │ │ self._report_to_hp_search(trial, self.state.global_step, metrics) │ │ 2723 │ │ │ │ │ 2724 │ │ │ # Run delayed LR scheduler now that metrics are populated │ │ │ │ /usr/local/lib/python3.10/dist-packages/transformers/trainer_seq2seq.py:180 in evaluate │ │ │ │ 177 │ │ # We don't want to drop samples in general │ │ 178 │ │ self.gather_function = self.accelerator.gather │ │ 179 │ │ self._gen_kwargs = gen_kwargs │ │ ❱ 180 │ │ return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix │ │ 181 │ │ │ 182 │ def predict( │ │ 183 │ │ self, │ │ │ │ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:3572 in evaluate │ │ │ │ 3569 │ │ start_time = time.time() │ │ 3570 │ │ │ │ 3571 │ │ eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else se │ │ ❱ 3572 │ │ output = eval_loop( │ │ 3573 │ │ │ eval_dataloader, │ │ 3574 │ │ │ description="Evaluation", │ │ 3575 │ │ │ # No point gathering the predictions if there are no metrics, otherwise we d │ │ │ │ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:3747 in evaluation_loop │ │ │ │ 3744 │ │ observed_num_examples = 0 │ │ 3745 │ │ │ │ 3746 │ │ # Main evaluation loop │ │ ❱ 3747 │ │ for step, inputs in enumerate(dataloader): │ │ 3748 │ │ │ # Update the observed num examples │ │ 3749 │ │ │ observed_batch_size = find_batch_size(inputs) │ │ 3750 │ │ │ if observed_batch_size is not None: │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py:463 in iter │ │ │ │ 460 │ │ │ try: │ │ 461 │ │ │ │ # But we still move it to the device so it is done before StopIteration │ │ 462 │ │ │ │ if self.device is not None: │ │ ❱ 463 │ │ │ │ │ current_batch = send_to_device(current_batch, self.device, non_block │ │ 464 │ │ │ │ next_batch = next(dataloader_iter) │ │ 465 │ │ │ │ if batch_index >= self.skip_batches: │ │ 466 │ │ │ │ │ yield current_batch │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py:160 in send_to_device │ │ │ │ 157 │ │ try: │ │ 158 │ │ │ return tensor.to(device, non_blocking=non_blocking) │ │ 159 │ │ except TypeError: # .to() doesn't accept non_blocking as kwarg │ │ ❱ 160 │ │ │ return tensor.to(device) │ │ 161 │ │ except AssertionError as error: │ │ 162 │ │ │ #torch.Tensor.to()is not supported bytorch_npu` (see this [is │ │ 163 │ │ │ # This call is inside the try-block since is_npu_available is not supported │ │ │ │ /usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:800 in to │ │ │ │ 797 │ │ # Otherwise it passes the casts down and casts the LongTensor containing the tok │ │ 798 │ │ # into a HalfTensor │ │ 799 │ │ if isinstance(device, str) or is_torch_device(device) or isinstance(device, int) │ │ ❱ 800 │ │ │ self.data = {k: v.to(device=device) for k, v in self.data.items()} │ │ 801 │ │ else: │ │ 802 │ │ │ logger.warning(f"Attempting to cast a BatchEncoding to type {str(device)}. T │ │ 803 │ │ return self │ │ │ │ /usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:800 in │ │ │ │ │ │ 797 │ │ # Otherwise it passes the casts down and casts the LongTensor containing the tok │ │ 798 │ │ # into a HalfTensor │ │ 799 │ │ if isinstance(device, str) or is_torch_device(device) or isinstance(device, int) │ │ ❱ 800 │ │ │ self.data = {k: v.to(device=device) for k, v in self.data.items()} │ │ 801 │ │ else: │ │ 802 │ │ │ logger.warning(f"Attempting to cast a BatchEncoding to type {str(device)}. T │ │ 803 │ │ return self │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ AttributeError: 'NoneType' object has no attribute 'to' 17% 500/3000 [06:32<32:41, 1.27it/s]

Who can help? / 谁可以帮助到您？

No response

Information / 问题信息

[X] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

1.下载chatglm3权重 2、把原来的lora.yaml微调jupyter上传colab 3、修改训练命名为!python finetune_hf.py data/AdvertiseGen_fix ../../chatglm3-6b configs/ptuning_v2.yaml 4、买了colab会员，所以硬件资源是没问题，显卡也比要求的配置高

Expected behavior / 期待表现

如何修改才能解决这个bug

ln410 commented 2 months ago

我使用lora训练的，遇到一模一样的问题，经过查询和分析基本确定就是环境的问题，尤其是transformers这个环境，我还未解决，期待大佬更新解决

zRzRzRzRzRzRzR commented 2 months ago

更新到最新代码了吗？

shuye-cheung commented 2 months ago

是transformers版本的问题，可以将transformers从4.41.1降到4.40.0试试。

dl-dayup commented 2 months ago

将transformers从4.41.1降到4.40.0 ，这个方法好用，已经能跑起来了

xudl33 commented 1 week ago

同样的问题，transformers降级到4.40.0也没行，报错换成TrainerState.init() got an unexpected keyword argument 'stateful_callbacks'了

resume checkpoint from checkpoint-500 Loading model from ./output/checkpoint-500. ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /llm/ChatGLM3/finetune_demo/finetune_hf.py:552 in main │ │ │ │ 549 │ │ │ │ model.enable_input_require_grads() │ │ 550 │ │ │ │ checkpoint_directory = os.path.join(output_dir, "checkpoint-" + str(chec │ │ 551 │ │ │ │ print("resume checkpoint from checkpoint-" + str(checkpoint_sn)) │ │ ❱ 552 │ │ │ │ trainer.train(resume_from_checkpoint=checkpoint_directory) │ │ 553 │ │ │ else: │ │ 554 │ │ │ │ trainer.train() │ │ 555 │ │ else: │ │ │ │ /root/miniconda3/envs/charglm3_finetune_demo/lib/python3.10/site-packages/transformers/trainer.p │ │ y:1833 in train │ │ │ │ 1830 │ │ │ if not is_sagemaker_mp_enabled() and not self.is_deepspeed_enabled and not s │ │ 1831 │ │ │ │ self._load_from_checkpoint(resume_from_checkpoint) │ │ 1832 │ │ │ # In case of repeating the find_executable_batch_size, setself._train_batc │ │ ❱ 1833 │ │ │ state = TrainerState.load_from_json(os.path.join(resume_from_checkpoint, TRA │ │ 1834 │ │ │ if state.train_batch_size is not None: │ │ 1835 │ │ │ │ self._train_batch_size = state.train_batch_size │ │ 1836 │ │ │ │ /root/miniconda3/envs/charglm3_finetune_demo/lib/python3.10/site-packages/transformers/trainer_c │ │ allback.py:123 in load_from_json │ │ │ │ 120 │ │ """Create an instance from the content of json_path.""" │ │ 121 │ │ with open(json_path, "r", encoding="utf-8") as f: │ │ 122 │ │ │ text = f.read() │ │ ❱ 123 │ │ return cls(**json.loads(text)) │ │ 124 │ │ 125 │ │ 126 @dataclass │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ TypeError: TrainerState.init() got an unexpected keyword argument 'stateful_callbacks'`

THUDM / ChatGLM3