微调报错IndexError: list index out of range

shenkunlovecoding commented 2 months ago

System Info / 系統信息

CUDA12.6 windows10 22H2 3090Ti 3990X miniconda

Who can help? / 谁可以帮助到您？

No response

Information / 问题信息

[ ] The official example scripts / 官方的示例脚本
[X] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

configs.zip 训练数据为memotrace生成后使用https://www.tojsonl.com/转换为jsonl 报错日志：(DG-C) D:\workspace\DG-C\GLM-4\finetune_demo>python finetune.py ./data ./model ./configs/lora.yaml Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 6.60it/s] trainable params: 2,785,280 || all params: 9,402,736,640 || trainable%: 0.0296 Generating train split: 1728 examples [00:00, 162768.53 examples/s] Generating validation split: 1728 examples [00:00, 192100.43 examples/s] Generating test split: 1728 examples [00:00, 204889.39 examples/s] Map: 0%| | 0/1728 [00:00<?, ? examples/s] ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ D:\workspace\DG-C\GLM-4\finetune_demo\finetune.py:408 in main │ │ │ │ 405 │ tokenizer, model = load_tokenizer_and_model(model_dir, peft_config=ft_config.peft_co │ │ 406 │ data_manager = DataManager(data_dir, ft_config.data_config) │ │ 407 │ │ │ ❱ 408 │ train_dataset = data_manager.get_dataset( │ │ 409 │ │ Split.TRAIN, │ │ 410 │ │ functools.partial( │ │ 411 │ │ │ process_batch, │ │ │ │ D:\workspace\DG-C\GLM-4\finetune_demo\finetune.py:229 in get_dataset │ │ │ │ 226 │ │ │ remove_columns = orig_dataset.column_names │ │ 227 │ │ else: │ │ 228 │ │ │ remove_columns = None │ │ ❱ 229 │ │ return orig_dataset.map( │ │ 230 │ │ │ process_fn, │ │ 231 │ │ │ batched=batched, │ │ 232 │ │ │ remove_columns=remove_columns, │ │ │ │ C:\Users\shen\miniconda3\envs\DG-C\Lib\site-packages\datasets\arrow_dataset.py:602 in wrapper │ │ │ │ 599 │ │ else: │ │ 600 │ │ │ self: "Dataset" = kwargs.pop("self") │ │ 601 │ │ # apply actual function │ │ ❱ 602 │ │ out: Union["Dataset", "DatasetDict"] = func(self, *args, kwargs) │ │ 603 │ │ datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [ou │ │ 604 │ │ for dataset in datasets: │ │ 605 │ │ │ # Remove task templates if a column mapping of the template is no longer val │ │ │ │ C:\Users\shen\miniconda3\envs\DG-C\Lib\site-packages\datasets\arrow_dataset.py:567 in wrapper │ │ │ │ 564 │ │ │ "output_all_columns": self._output_all_columns, │ │ 565 │ │ } │ │ 566 │ │ # apply actual function │ │ ❱ 567 │ │ out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) │ │ 568 │ │ datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [ou │ │ 569 │ │ # re-apply format to the output │ │ 570 │ │ for dataset in datasets: │ │ │ │ C:\Users\shen\miniconda3\envs\DG-C\Lib\site-packages\datasets\arrow_dataset.py:3161 in map │ │ │ │ 3158 │ │ │ │ │ total=pbar_total, │ │ 3159 │ │ │ │ │ desc=desc or "Map", │ │ 3160 │ │ │ │ ) as pbar: │ │ ❱ 3161 │ │ │ │ │ for rank, done, content in Dataset._map_single(dataset_kwargs): │ │ 3162 │ │ │ │ │ │ if done: │ │ 3163 │ │ │ │ │ │ │ shards_done += 1 │ │ 3164 │ │ │ │ │ │ │ logger.debug(f"Finished processing shard number {rank} of {n │ │ │ │ C:\Users\shen\miniconda3\envs\DG-C\Lib\site-packages\datasets\arrow_dataset.py:3552 in │ │ _map_single │ │ │ │ 3549 │ │ │ │ │ │ │ range((slice(i, i + batch_size).indices(shard.num_rows))) │ │ 3550 │ │ │ │ │ │ ) # Something simpler? │ │ 3551 │ │ │ │ │ │ try: │ │ ❱ 3552 │ │ │ │ │ │ │ batch = apply_function_on_filtered_inputs( │ │ 3553 │ │ │ │ │ │ │ │ batch, │ │ 3554 │ │ │ │ │ │ │ │ indices, │ │ 3555 │ │ │ │ │ │ │ │ check_same_num_examples=len(shard.list_indexes()) > 0, │ │ │ │ C:\Users\shen\miniconda3\envs\DG-C\Lib\site-packages\datasets\arrow_dataset.py:3421 in │ │ apply_function_on_filtered_inputs │ │ │ │ 3418 │ │ │ │ additional_args += (effective_indices,) │ │ 3419 │ │ │ if with_rank: │ │ 3420 │ │ │ │ additional_args += (rank,) │ │ ❱ 3421 │ │ │ processed_inputs = function(fn_args, additional_args, fn_kwargs) │ │ 3422 │ │ │ if isinstance(processed_inputs, LazyDict): │ │ 3423 │ │ │ │ processed_inputs = { │ │ 3424 │ │ │ │ │ k: v for k, v in processed_inputs.data.items() if k not in processed │ │ │ │ D:\workspace\DG-C\GLM-4\finetune_demo\finetune.py:263 in process_batch │ │ │ │ 260 │ │ input_ids = [151331, 151333] │ │ 261 │ │ loss_masks = [False, False] │ │ 262 │ │ if combine: │ │ ❱ 263 │ │ │ new_input_ids = tokenizer.apply_chat_template(conv, tokenize=True, return_di │ │ 264 │ │ │ input_ids = new_input_ids │ │ 265 │ │ │ loss_masks = [False] * len(input_ids) │ │ 266 │ │ │ last_assistant_index = len(input_ids) - input_ids[::-1].index(151337) - 1 │ │ │ │ C:\Users\shen\miniconda3\envs\DG-C\Lib\site-packages\transformers\tokenization_utils_base.py:178 │ │ 6 in apply_chat_template │ │ │ │ 1783 │ │ compiled_template = self._compile_jinja_template(chat_template) │ │ 1784 │ │ │ │ 1785 │ │ if isinstance(conversation, (list, tuple)) and ( │ │ ❱ 1786 │ │ │ isinstance(conversation[0], (list, tuple)) or hasattr(conversation[0], "mess │ │ 1787 │ │ ): │ │ 1788 │ │ │ conversations = conversation │ │ 1789 │ │ │ is_batched = True │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ IndexError: list index out of range

Expected behavior / 期待表现

正常微调

shenkunlovecoding commented 2 months ago

在配置文件中关闭combine不知道为什么会可以正常微调

zRzRzRzRzRzRzR commented 2 months ago

关闭combine后就时每个轮都正常算loss，开启后只有最后一轮算

THUDM / GLM-4