Open LumenScope opened 6 months ago
补充:修改映射为
def SYSTEM_map_fn(example):
return {
'conversation': [{
'system': f"{example['instruction']}",
'input': f"{example['input']}",
'output': example['output']
}]
}
之后再使用自定义数据集仍然报错:
05/15 16:27:56 - mmengine - INFO - xtuner_dataset_timeout = 1:00:00
Generating train split: 52002 examples [00:00, 96581.19 examples/s]
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/datasets/builder.py", line 2011, in _prepare_split_single
[rank0]: writer.write_table(table)
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/datasets/arrow_writer.py", line 585, in write_table
[rank0]: pa_table = table_cast(pa_table, self._schema)
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/datasets/table.py", line 2295, in table_cast
[rank0]: return cast_table_to_schema(table, schema)
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/datasets/table.py", line 2249, in cast_table_to_schema
[rank0]: raise CastError(
[rank0]: datasets.table.CastError: Couldn't cast
[rank0]: submitTime: string
[rank0]: eval-key: list<item: null>
[rank0]: child 0, item: null
[rank0]: content: string
[rank0]: title: string
[rank0]: eval-title: list<item: null>
[rank0]: child 0, item: null
[rank0]: eval-senti: list<item: null>
[rank0]: child 0, item: null
[rank0]: eval: int64
[rank0]: createUser: string
[rank0]: replyContent: string
[rank0]: replyDeptName: string
[rank0]: mainDeptName: string
[rank0]: publicTime: string
[rank0]: id: int64
[rank0]: replyTime: string
[rank0]: to
[rank0]: {'instruction': Value(dtype='string', id=None), 'input': Value(dtype='string', id=None), 'output': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None)}
[rank0]: because column names don't match
[rank0]: During handling of the above exception, another exception occurred:
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 360, in <module>
[rank0]: main()
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 356, in main
[rank0]: runner.train()
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
[rank0]: self._train_loop = self.build_train_loop(
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
[rank0]: loop = LOOPS.build(
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
[rank0]: return self.build_func(cfg, *args, **kwargs, registry=self)
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
[rank0]: obj = obj_cls(**args) # type: ignore
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in __init__
[rank0]: dataloader = runner.build_dataloader(
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
[rank0]: dataset = DATASETS.build(dataset_cfg)
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
[rank0]: return self.build_func(cfg, *args, **kwargs, registry=self)
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
[rank0]: obj = obj_cls(**args) # type: ignore
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 308, in process_hf_dataset
[rank0]: dataset = process(**kwargs)
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 167, in process
[rank0]: dataset = build_origin_dataset(dataset, split)
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 30, in build_origin_dataset
[rank0]: dataset = BUILDER.build(dataset)
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
[rank0]: return self.build_func(cfg, *args, **kwargs, registry=self)
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
[rank0]: obj = obj_cls(**args) # type: ignore
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/datasets/load.py", line 2609, in load_dataset
[rank0]: builder_instance.download_and_prepare(
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/datasets/builder.py", line 1027, in download_and_prepare
[rank0]: self._download_and_prepare(
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/datasets/builder.py", line 1122, in _download_and_prepare
[rank0]: self._prepare_split(split_generator, **prepare_split_kwargs)
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/datasets/builder.py", line 1882, in _prepare_split
[rank0]: for job_id, done, content in self._prepare_split_single(
[rank0]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/datasets/builder.py", line 2013, in _prepare_split_single
[rank0]: raise DatasetGenerationCastError.from_cast_error(
[rank0]: datasets.exceptions.DatasetGenerationCastError: An error occurred while generating the dataset
[rank0]: All the data files must have the same columns, but at some point there are 14 new columns (submitTime, eval-key, content, eval-title, eval-senti, title, eval, createUser, replyContent, replyDeptName, mainDeptName, publicTime, id, replyTime) and 4 missing columns (text, output, input, instruction).
[rank0]: This happened while the json dataset builder was generating data using
[rank0]: /work/tzz/xtuner/data/org/qa_train.json
[rank0]: Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)
[rank1]:[E ProcessGroupGloo.cpp:144] Rank 1 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[rank2]:[E ProcessGroupGloo.cpp:144] Rank 2 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[rank3]:[E ProcessGroupGloo.cpp:144] Rank 3 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[rank1]: Traceback (most recent call last):
[rank1]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 360, in <module>
[rank1]: main()
[rank1]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 356, in main
[rank1]: runner.train()
[rank1]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
[rank1]: self._train_loop = self.build_train_loop(
[rank1]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
[rank1]: loop = LOOPS.build(
[rank1]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
[rank1]: return self.build_func(cfg, *args, **kwargs, registry=self)
[rank1]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
[rank1]: obj = obj_cls(**args) # type: ignore
[rank1]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in __init__
[rank1]: dataloader = runner.build_dataloader(
[rank1]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
[rank1]: dataset = DATASETS.build(dataset_cfg)
[rank1]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
[rank1]: return self.build_func(cfg, *args, **kwargs, registry=self)
[rank1]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
[rank1]: obj = obj_cls(**args) # type: ignore
[rank1]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 313, in process_hf_dataset
[rank1]: dist.monitored_barrier(group=group_gloo, timeout=xtuner_dataset_timeout)
[rank1]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3763, in monitored_barrier
[rank1]: return group_to_use.monitored_barrier(timeout, wait_all_ranks=wait_all_ranks)
[rank1]: RuntimeError: Rank 1 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[rank1]: Original exception:
[rank1]: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [172.17.0.2]:19629
[rank2]: Traceback (most recent call last):
[rank2]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 360, in <module>
[rank2]: main()
[rank2]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 356, in main
[rank2]: runner.train()
[rank2]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
[rank2]: self._train_loop = self.build_train_loop(
[rank2]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
[rank2]: loop = LOOPS.build(
[rank2]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
[rank2]: return self.build_func(cfg, *args, **kwargs, registry=self)
[rank2]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
[rank2]: obj = obj_cls(**args) # type: ignore
[rank2]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in __init__
[rank2]: dataloader = runner.build_dataloader(
[rank2]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
[rank2]: dataset = DATASETS.build(dataset_cfg)
[rank2]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
[rank2]: return self.build_func(cfg, *args, **kwargs, registry=self)
[rank2]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
[rank2]: obj = obj_cls(**args) # type: ignore
[rank2]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 313, in process_hf_dataset
[rank2]: dist.monitored_barrier(group=group_gloo, timeout=xtuner_dataset_timeout)
[rank2]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3763, in monitored_barrier
[rank2]: return group_to_use.monitored_barrier(timeout, wait_all_ranks=wait_all_ranks)
[rank2]: RuntimeError: Rank 2 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[rank2]: Original exception:
[rank2]: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [172.17.0.2]:23260
[rank3]: Traceback (most recent call last):
[rank3]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 360, in <module>
[rank3]: main()
[rank3]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 356, in main
[rank3]: runner.train()
[rank3]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
[rank3]: self._train_loop = self.build_train_loop(
[rank3]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
[rank3]: loop = LOOPS.build(
[rank3]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
[rank3]: return self.build_func(cfg, *args, **kwargs, registry=self)
[rank3]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
[rank3]: obj = obj_cls(**args) # type: ignore
[rank3]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in __init__
[rank3]: dataloader = runner.build_dataloader(
[rank3]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
[rank3]: dataset = DATASETS.build(dataset_cfg)
[rank3]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
[rank3]: return self.build_func(cfg, *args, **kwargs, registry=self)
[rank3]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
[rank3]: obj = obj_cls(**args) # type: ignore
[rank3]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 313, in process_hf_dataset
[rank3]: dist.monitored_barrier(group=group_gloo, timeout=xtuner_dataset_timeout)
[rank3]: File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3763, in monitored_barrier
[rank3]: return group_to_use.monitored_barrier(timeout, wait_all_ranks=wait_all_ranks)
[rank3]: RuntimeError: Rank 3 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[rank3]: Original exception:
[rank3]: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [172.17.0.2]:51333
E0515 16:28:04.559000 140122573240128 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 43164) of binary: /root/miniconda3/envs/xtuner/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/xtuner/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/root/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-05-15_16:28:04
host : a19780ffc442
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 43165)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-05-15_16:28:04
host : a19780ffc442
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 43166)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-05-15_16:28:04
host : a19780ffc442
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 43167)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-05-15_16:28:04
host : a19780ffc442
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 43164)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
@LumenScope 首先,是 map fn 的定义方式,mmengine 的 config 没有办法在 config 文件内定义新的函数,只能通过 import 的方式,具体见
https://github.com/InternLM/xtuner/tree/main/examples/demo_data/multi_turn_2#config
其次,对于自定义的数据集,可以通过 xtuner check-custom-dataset $CONFIG
检查格式哪里有错误
最后,可以通过 xtuner log-dataset $CONFIG
来查看转换后的数据样式
获取数据集到本地的代码:
获取到的JSONL样例展示:
运行以下脚本
NPROC_PER_NODE=4 xtuner train /work/tzz/xtuner/config/qwen1_5_14b_chat_qlora_alpaca_e3_copy.py --deepspeed deepspeed_zero3^C
报错:
将数据集换成自定义数据集:
由于我观察到映射并未使用text,因此我没有加入此字段:
报错:
提示列明不一致,但是映射并未看到使用
text
使用原本默认数据集会报错。