THUDM / ChatGLM3

ChatGLM3 series: Open Bilingual Chat LLMs | 开源双语对话语言模型
Apache License 2.0
13.19k stars 1.52k forks source link

LORA 微调报错 #1242

Closed ZhuXuesong7423 closed 1 month ago

ZhuXuesong7423 commented 2 months ago

System Info / 系統信息

cuda:12.2 transformers:4.41.1 python:3.11 system:Ubuntu 20.04.6

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

Reproduction / 复现过程

安装官方数据格式以及 finetune_demo/lora_finetune.ipynb 中的微调方法,出现报错如下: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /mnt/llm/ChatGLM3/finetune_demo/finetune_hf.py:530 in main │ │ │ │ 527 │ │ │ return_tensors='pt', │ │ 528 │ │ ), │ │ 529 │ │ train_dataset=train_dataset, │ │ ❱ 530 │ │ eval_dataset=val_dataset.select(list(range(50))), │ │ 531 │ │ tokenizer=tokenizer if use_tokenizer else None, # LORA does not need tokenizer │ │ 532 │ │ compute_metrics=functools.partial(compute_metrics, tokenizer=tokenizer), │ │ 533 │ ) │ │ │ │ /mnt/llm/ChatGLM3/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py:567 in wrapper │ │ │ │ 564 │ │ │ "output_all_columns": self._output_all_columns, │ │ 565 │ │ } │ │ 566 │ │ # apply actual function │ │ ❱ 567 │ │ out: Union["Dataset", "DatasetDict"] = func(self, *args, kwargs) │ │ 568 │ │ datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [ou │ │ 569 │ │ # re-apply format to the output │ │ 570 │ │ for dataset in datasets: │ │ │ │ /mnt/llm/ChatGLM3/.venv/lib/python3.11/site-packages/datasets/fingerprint.py:482 in wrapper │ │ │ │ 479 │ │ │ │ │ 480 │ │ │ # Call actual function │ │ 481 │ │ │ │ │ ❱ 482 │ │ │ out = func(dataset, *args, *kwargs) │ │ 483 │ │ │ │ │ 484 │ │ │ # Update fingerprint of in-place transforms + update in-place history of tra │ │ 485 │ │ │ │ /mnt/llm/ChatGLM3/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py:3898 in select │ │ │ │ 3895 │ │ │ │ counter_from_start = itertools.count(start=start) │ │ 3896 │ │ │ │ if all(i == j for i, j in zip(indices, counter_from_start)): │ │ 3897 │ │ │ │ │ length = next(counter_from_start) - start │ │ ❱ 3898 │ │ │ │ │ return self._select_contiguous(start, length, new_fingerprint=new_fi │ │ 3899 │ │ │ │ 3900 │ │ # If not contiguous, we need to create a new indices mapping │ │ 3901 │ │ return self._select_with_indices_mapping( │ │ │ │ /mnt/llm/ChatGLM3/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py:567 in wrapper │ │ │ │ 564 │ │ │ "output_all_columns": self._output_all_columns, │ │ 565 │ │ } │ │ 566 │ │ # apply actual function │ │ ❱ 567 │ │ out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) │ │ 568 │ │ datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [ou │ │ 569 │ │ # re-apply format to the output │ │ 570 │ │ for dataset in datasets: │ │ │ │ /mnt/llm/ChatGLM3/.venv/lib/python3.11/site-packages/datasets/fingerprint.py:482 in wrapper │ │ │ │ 479 │ │ │ │ │ 480 │ │ │ # Call actual function │ │ 481 │ │ │ │ │ ❱ 482 │ │ │ out = func(dataset, *args, **kwargs) │ │ 483 │ │ │ │ │ 484 │ │ │ # Update fingerprint of in-place transforms + update in-place history of tra │ │ 485 │ │ │ │ /mnt/llm/ChatGLM3/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py:3948 in │ │ _select_contiguous │ │ │ │ 3945 │ │ │ return self │ │ 3946 │ │ │ │ 3947 │ │ _check_valid_indices_value(start, len(self)) │ │ ❱ 3948 │ │ _check_valid_indices_value(start + length - 1, len(self)) │ │ 3949 │ │ if self._indices is None or length == 0: │ │ 3950 │ │ │ return Dataset( │ │ 3951 │ │ │ │ self.data.slice(start, length), │ │ │ │ /mnt/llm/ChatGLM3/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py:659 in │ │ _check_valid_indices_value │ │ │ │ 656 │ │ 657 def _check_valid_indices_value(index, size): │ │ 658 │ if (index < 0 and index + size < 0) or (index >= size): │ │ ❱ 659 │ │ raise IndexError(f"Index {index} out of range for dataset of size {size}.") │ │ 660 │ │ 661 │ │ 662 class NonExistentDatasetError(Exception): │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ IndexError: Index 49 out of range for dataset of size 5.

Expected behavior / 期待表现

能正常微调

zzlTim commented 1 month ago

你这个报错就是说eval数量太少了,少于50条,可以增加eval的数量,或者去代码把这个50改小