微调报错 ValueError: 151337 is not in list

TTXS123OK commented 1 month ago

System Info / 系統信息

cuda 12.2 transformers 4.40.0 python 3.10.13 Ubuntu 20.04.6 LTS 4090 单卡

Who can help? / 谁可以帮助到您？

No response

Information / 问题信息

[X] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

已 pull 最新 repo 代码 https://github.com/THUDM/GLM-4/commit/e1bc2691d4bac4047e0c335ce58ce9a3ebb3b100
1. python finetune.py /data /glm-4-9b-chat configs/ptuning_v2.yaml yes 此时报错 TypeError: Seq2SeqTrainingArguments.init() got an unexpected keyword argument 'eval_strategy' 将 ptuning_v2.yaml 中的 eval_strategy: steps 一行注释，解决该报错
2. 再次运行 python finetune.py /data /glm-4-9b-chat configs/ptuning_v2.yaml yes 此时报错 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:07<00:00, 1.31it/s] trainable params: 10,485,760 || all params: 9,410,437,120 || trainable%: 0.1114 Generating train split: 140 examples [00:00, 31149.68 examples/s] Generating test split: 60 examples [00:00, 29984.30 examples/s] Map: 0%| | 0/140 [00:00<?, ? examples/s] ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /output/GLM-4/finetune_demo/finetune.py:408 in main │ │ │ │ 405 │ tokenizer, model = load_tokenizer_and_model(model_dir, peft_config=ft_config.peft_co │ │ 406 │ data_manager = DataManager(data_dir, ft_config.data_config) │ │ 407 │ │ │ ❱ 408 │ train_dataset = data_manager.get_dataset( │ │ 409 │ │ Split.TRAIN, │ │ 410 │ │ functools.partial( │ │ 411 │ │ │ process_batch, │ │ │ │ /output/GLM-4/finetune_demo/finetune.py:229 in get_dataset │ │ │ │ 226 │ │ │ remove_columns = orig_dataset.column_names │ │ 227 │ │ else: │ │ 228 │ │ │ remove_columns = None │ │ ❱ 229 │ │ return orig_dataset.map( │ │ 230 │ │ │ process_fn, │ │ 231 │ │ │ batched=batched, │ │ 232 │ │ │ remove_columns=remove_columns, │ │ │ │ /usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py:602 in wrapper │ │ │ │ 599 │ │ else: │ │ 600 │ │ │ self: "Dataset" = kwargs.pop("self") │ │ 601 │ │ # apply actual function │ │ ❱ 602 │ │ out: Union["Dataset", "DatasetDict"] = func(self, *args, kwargs) │ │ 603 │ │ datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [ou │ │ 604 │ │ for dataset in datasets: │ │ 605 │ │ │ # Remove task templates if a column mapping of the template is no longer val │ │ │ │ /usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py:567 in wrapper │ │ │ │ 564 │ │ │ "output_all_columns": self._output_all_columns, │ │ 565 │ │ } │ │ 566 │ │ # apply actual function │ │ ❱ 567 │ │ out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) │ │ 568 │ │ datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [ou │ │ 569 │ │ # re-apply format to the output │ │ 570 │ │ for dataset in datasets: │ │ │ │ /usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py:3161 in map │ │ │ │ 3158 │ │ │ │ │ total=pbar_total, │ │ 3159 │ │ │ │ │ desc=desc or "Map", │ │ 3160 │ │ │ │ ) as pbar: │ │ ❱ 3161 │ │ │ │ │ for rank, done, content in Dataset._map_single(dataset_kwargs): │ │ 3162 │ │ │ │ │ │ if done: │ │ 3163 │ │ │ │ │ │ │ shards_done += 1 │ │ 3164 │ │ │ │ │ │ │ logger.debug(f"Finished processing shard number {rank} of {n │ │ │ │ /usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py:3552 in _map_single │ │ │ │ 3549 │ │ │ │ │ │ │ range((slice(i, i + batch_size).indices(shard.num_rows))) │ │ 3550 │ │ │ │ │ │ ) # Something simpler? │ │ 3551 │ │ │ │ │ │ try: │ │ ❱ 3552 │ │ │ │ │ │ │ batch = apply_function_on_filtered_inputs( │ │ 3553 │ │ │ │ │ │ │ │ batch, │ │ 3554 │ │ │ │ │ │ │ │ indices, │ │ 3555 │ │ │ │ │ │ │ │ check_same_num_examples=len(shard.list_indexes()) > 0, │ │ │ │ /usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py:3421 in │ │ apply_function_on_filtered_inputs │ │ │ │ 3418 │ │ │ │ additional_args += (effective_indices,) │ │ 3419 │ │ │ if with_rank: │ │ 3420 │ │ │ │ additional_args += (rank,) │ │ ❱ 3421 │ │ │ processed_inputs = function(fn_args, additional_args, fn_kwargs) │ │ 3422 │ │ │ if isinstance(processed_inputs, LazyDict): │ │ 3423 │ │ │ │ processed_inputs = { │ │ 3424 │ │ │ │ │ k: v for k, v in processed_inputs.data.items() if k not in processed │ │ │ │ /output/GLM-4/finetune_demo/finetune.py:266 in process_batch │ │ │ │ 263 │ │ │ new_input_ids = tokenizer.apply_chat_template(conv, tokenize=True, return_di │ │ 264 │ │ │ input_ids = new_input_ids │ │ 265 │ │ │ loss_masks = [False] * len(input_ids) │ │ ❱ 266 │ │ │ last_assistant_index = len(input_ids) - input_ids[::-1].index(151337) - 1 │ │ 267 │ │ │ for j in range(last_assistant_index + 1, len(input_ids)): │ │ 268 │ │ │ │ loss_masks[j] = True │ │ 269 │ │ else: │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ValueError: 151337 is not in list

Expected behavior / 期待表现

希望能够解决该报错，跑通微调代码

Jade0321 commented 1 month ago

我也遇到了，但我是ValueError: 151339 is not in list

zRzRzRzRzRzRzR commented 1 month ago

更新transformers==4.42.4和我们huggingface的文件

McRays commented 1 month ago

更新transformers==4.42.4和我们huggingface的文件

我也遇到了这个问题，我把huggingface的文件代码以及配置文件更新了，transformers的版本为4.43.3 运行完还是会遇到这个问题

jainelee666666 commented 1 month ago

更新transformers==4.42.4和我们huggingface的文件

我也遇到了这个问题，我把huggingface的文件代码以及配置文件更新了，transformers的版本为4.43.3 运行完还是会遇到这个问题

请问你现在解决了吗，我也遇到这个问题了

jainelee666666 commented 1 month ago

System Info / 系統信息

cuda 12.2 transformers 4.40.0 python 3.10.13 Ubuntu 20.04.6 LTS 4090 单卡

Who can help? / 谁可以帮助到您？

No response

Information / 问题信息

[x] The official example scripts / 官方的示例脚本

[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

已 pull 最新 repo 代码 e1bc269

python finetune.py /data /glm-4-9b-chat configs/ptuning_v2.yaml yes 此时报错 TypeError: Seq2SeqTrainingArguments.init() got an unexpected keyword argument 'eval_strategy' 将 ptuning_v2.yaml 中的 eval_strategy: steps 一行注释，解决该报错

再次运行 python finetune.py /data /glm-4-9b-chat configs/ptuning_v2.yaml yes 此时报错 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:07<00:00, 1.31it/s] trainable params: 10,485,760 || all params: 9,410,437,120 || trainable%: 0.1114 Generating train split: 140 examples [00:00, 31149.68 examples/s] Generating test split: 60 examples [00:00, 29984.30 examples/s] Map: 0%| | 0/140 [00:00<?, ? examples/s] ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /output/GLM-4/finetune_demo/finetune.py:408 in main │ │ │ │ 405 │ tokenizer, model = load_tokenizer_and_model(model_dir, peft_config=ft_config.peft_co │ │ 406 │ data_manager = DataManager(data_dir, ft_config.data_config) │ │ 407 │ │ │ ❱ 408 │ train_dataset = data_manager.get_dataset( │ │ 409 │ │ Split.TRAIN, │ │ 410 │ │ functools.partial( │ │ 411 │ │ │ process_batch, │ │ │ │ /output/GLM-4/finetune_demo/finetune.py:229 in get_dataset │ │ │ │ 226 │ │ │ remove_columns = orig_dataset.column_names │ │ 227 │ │ else: │ │ 228 │ │ │ remove_columns = None │ │ ❱ 229 │ │ return orig_dataset.map( │ │ 230 │ │ │ process_fn, │ │ 231 │ │ │ batched=batched, │ │ 232 │ │ │ remove_columns=remove_columns, │ │ │ │ /usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py:602 in wrapper │ │ │ │ 599 │ │ else: │ │ 600 │ │ │ self: "Dataset" = kwargs.pop("self") │ │ 601 │ │ # apply actual function │ │ ❱ 602 │ │ out: Union["Dataset", "DatasetDict"] = func(self, *args, kwargs) │ │ 603 │ │ datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [ou │ │ 604 │ │ for dataset in datasets: │ │ 605 │ │ │ # Remove task templates if a column mapping of the template is no longer val │ │ │ │ /usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py:567 in wrapper │ │ │ │ 564 │ │ │ "output_all_columns": self._output_all_columns, │ │ 565 │ │ } │ │ 566 │ │ # apply actual function │ │ ❱ 567 │ │ out: Union["Dataset", "DatasetDict"] = func(self, _args, kwargs) │ │ 568 │ │ datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [ou │ │ 569 │ │ # re-apply format to the output │ │ 570 │ │ for dataset in datasets: │ │ │ │ /usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py:3161 in map │ │ │ │ 3158 │ │ │ │ │ total=pbar_total, │ │ 3159 │ │ │ │ │ desc=desc or "Map", │ │ 3160 │ │ │ │ ) as pbar: │ │ ❱ 3161 │ │ │ │ │ for rank, done, content in Dataset._map_single(dataset_kwargs): │ │ 3162 │ │ │ │ │ │ if done: │ │ 3163 │ │ │ │ │ │ │ shards_done += 1 │ │ 3164 │ │ │ │ │ │ │ logger.debug(f"Finished processing shard number {rank} of {n │ │ │ │ /usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py:3552 in _mapsingle │ │ │ │ 3549 │ │ │ │ │ │ │ range((slice(i, i + batch_size).indices(shard.num_rows))) │ │ 3550 │ │ │ │ │ │ ) # Something simpler? │ │ 3551 │ │ │ │ │ │ try: │ │ ❱ 3552 │ │ │ │ │ │ │ batch = apply_function_on_filtered_inputs( │ │ 3553 │ │ │ │ │ │ │ │ batch, │ │ 3554 │ │ │ │ │ │ │ │ indices, │ │ 3555 │ │ │ │ │ │ │ │ check_same_num_examples=len(shard.list_indexes()) > 0, │ │ │ │ /usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py:3421 in │ │ apply_function_on_filtered_inputs │ │ │ │ 3418 │ │ │ │ additional_args += (effective_indices,) │ │ 3419 │ │ │ if with_rank: │ │ 3420 │ │ │ │ additional_args += (rank,) │ │ ❱ 3421 │ │ │ processed_inputs = function(fn_args, additional_args, fn_kwargs) │ │ 3422 │ │ │ if isinstance(processed_inputs, LazyDict): │ │ 3423 │ │ │ │ processed_inputs = { │ │ 3424 │ │ │ │ │ k: v for k, v in processed_inputs.data.items() if k not in processed │ │ │ │ /output/GLM-4/finetune_demo/finetune.py:266 in process_batch │ │ │ │ 263 │ │ │ new_input_ids = tokenizer.apply_chat_template(conv, tokenize=True, return_di │ │ 264 │ │ │ input_ids = new_input_ids │ │ 265 │ │ │ loss_masks = [False] * len(input_ids) │ │ ❱ 266 │ │ │ last_assistant_index = len(input_ids) - input_ids[::-1].index(151337) - 1 │ │ 267 │ │ │ for j in range(last_assistant_index + 1, len(input_ids)): │ │ 268 │ │ │ │ loss_masks[j] = True │ │ 269 │ │ else: │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ValueError: 151337 is not in list

Expected behavior / 期待表现

希望能够解决该报错，跑通微调代码我也遇到这个问题了，我用lora 微调就不会出现

OpEnD17 commented 1 month ago

System Info / 系統信息

cuda 12.2 transformers 4.40.0 python 3.10.13 Ubuntu 20.04.6 LTS 4090 单卡

Who can help? / 谁可以帮助到您？

No response

Information / 问题信息

[x] The official example scripts / 官方的示例脚本

[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

已 pull 最新 repo 代码 e1bc269

python finetune.py /data /glm-4-9b-chat configs/ptuning_v2.yaml yes 此时报错 TypeError: Seq2SeqTrainingArguments.init() got an unexpected keyword argument 'eval_strategy' 将 ptuning_v2.yaml 中的 eval_strategy: steps 一行注释，解决该报错

再次运行 python finetune.py /data /glm-4-9b-chat configs/ptuning_v2.yaml yes 此时报错 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:07<00:00, 1.31it/s] trainable params: 10,485,760 || all params: 9,410,437,120 || trainable%: 0.1114 Generating train split: 140 examples [00:00, 31149.68 examples/s] Generating test split: 60 examples [00:00, 29984.30 examples/s] Map: 0%| | 0/140 [00:00<?, ? examples/s] ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /output/GLM-4/finetune_demo/finetune.py:408 in main │ │ │ │ 405 │ tokenizer, model = load_tokenizer_and_model(model_dir, peft_config=ft_config.peft_co │ │ 406 │ data_manager = DataManager(data_dir, ft_config.data_config) │ │ 407 │ │ │ ❱ 408 │ train_dataset = data_manager.get_dataset( │ │ 409 │ │ Split.TRAIN, │ │ 410 │ │ functools.partial( │ │ 411 │ │ │ process_batch, │ │ │ │ /output/GLM-4/finetune_demo/finetune.py:229 in get_dataset │ │ │ │ 226 │ │ │ remove_columns = orig_dataset.column_names │ │ 227 │ │ else: │ │ 228 │ │ │ remove_columns = None │ │ ❱ 229 │ │ return orig_dataset.map( │ │ 230 │ │ │ process_fn, │ │ 231 │ │ │ batched=batched, │ │ 232 │ │ │ remove_columns=remove_columns, │ │ │ │ /usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py:602 in wrapper │ │ │ │ 599 │ │ else: │ │ 600 │ │ │ self: "Dataset" = kwargs.pop("self") │ │ 601 │ │ # apply actual function │ │ ❱ 602 │ │ out: Union["Dataset", "DatasetDict"] = func(self, *args, kwargs) │ │ 603 │ │ datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [ou │ │ 604 │ │ for dataset in datasets: │ │ 605 │ │ │ # Remove task templates if a column mapping of the template is no longer val │ │ │ │ /usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py:567 in wrapper │ │ │ │ 564 │ │ │ "output_all_columns": self._output_all_columns, │ │ 565 │ │ } │ │ 566 │ │ # apply actual function │ │ ❱ 567 │ │ out: Union["Dataset", "DatasetDict"] = func(self, _args, kwargs) │ │ 568 │ │ datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [ou │ │ 569 │ │ # re-apply format to the output │ │ 570 │ │ for dataset in datasets: │ │ │ │ /usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py:3161 in map │ │ │ │ 3158 │ │ │ │ │ total=pbar_total, │ │ 3159 │ │ │ │ │ desc=desc or "Map", │ │ 3160 │ │ │ │ ) as pbar: │ │ ❱ 3161 │ │ │ │ │ for rank, done, content in Dataset._map_single(dataset_kwargs): │ │ 3162 │ │ │ │ │ │ if done: │ │ 3163 │ │ │ │ │ │ │ shards_done += 1 │ │ 3164 │ │ │ │ │ │ │ logger.debug(f"Finished processing shard number {rank} of {n │ │ │ │ /usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py:3552 in _mapsingle │ │ │ │ 3549 │ │ │ │ │ │ │ range((slice(i, i + batch_size).indices(shard.num_rows))) │ │ 3550 │ │ │ │ │ │ ) # Something simpler? │ │ 3551 │ │ │ │ │ │ try: │ │ ❱ 3552 │ │ │ │ │ │ │ batch = apply_function_on_filtered_inputs( │ │ 3553 │ │ │ │ │ │ │ │ batch, │ │ 3554 │ │ │ │ │ │ │ │ indices, │ │ 3555 │ │ │ │ │ │ │ │ check_same_num_examples=len(shard.list_indexes()) > 0, │ │ │ │ /usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py:3421 in │ │ apply_function_on_filtered_inputs │ │ │ │ 3418 │ │ │ │ additional_args += (effective_indices,) │ │ 3419 │ │ │ if with_rank: │ │ 3420 │ │ │ │ additional_args += (rank,) │ │ ❱ 3421 │ │ │ processed_inputs = function(fn_args, additional_args, fn_kwargs) │ │ 3422 │ │ │ if isinstance(processed_inputs, LazyDict): │ │ 3423 │ │ │ │ processed_inputs = { │ │ 3424 │ │ │ │ │ k: v for k, v in processed_inputs.data.items() if k not in processed │ │ │ │ /output/GLM-4/finetune_demo/finetune.py:266 in process_batch │ │ │ │ 263 │ │ │ new_input_ids = tokenizer.apply_chat_template(conv, tokenize=True, return_di │ │ 264 │ │ │ input_ids = new_input_ids │ │ 265 │ │ │ loss_masks = [False] * len(input_ids) │ │ ❱ 266 │ │ │ last_assistant_index = len(input_ids) - input_ids[::-1].index(151337) - 1 │ │ 267 │ │ │ for j in range(last_assistant_index + 1, len(input_ids)): │ │ 268 │ │ │ │ loss_masks[j] = True │ │ 269 │ │ else: │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ValueError: 151337 is not in list

Expected behavior / 期待表现

希望能够解决该报错，跑通微调代码我也遇到这个问题了，我用lora 微调就不会出现

TypeError: Seq2SeqTrainingArguments.init() got an unexpected keyword argument 'eval_strategy' 这个问题可以通过升级transformers版本解决，4.42.4版本可用。请问151337 not in list 的问题解决了吗？

znxd-wh commented 1 month ago

combine改为false，就跳过报错代码了，关闭combine后就是每个轮都正常算loss，开启后只有最后一轮算，训练效率会低点而已。如果combine开启，我也是无论如何都逃不过151337 not in list的错误。transformers换了好几个版本，模型也更新到最新的了，都不行。

zRzRzRzRzRzRzR commented 1 month ago

我检查下这个问题，我估计是combine开启之后没有插入user头，感谢你的耐心

z8917749 commented 1 week ago

在finetune.py 里把 if combine: new_input_ids = tokenizer.apply_chat_template(conv, tokenize=True, return_dict=False) input_ids = new_input_ids ---> input_ids = [item for sublist in new_input_ids for item in sublist]（只改这行！！！）有两处，都要改！！！而且transformers=4.40.0 否则eval会报错如果 4.40.0 版本报eval_strategy错误建议把 lora.yaml 里的eval_strategy: steps 这行屏蔽掉

THUDM / GLM-4