OpenBMB / CPM-Bee

百亿参数的中英文双语基座大模型
2.68k stars 211 forks source link

预训练数据格式 #83

Open ScienGU opened 1 year ago

ScienGU commented 1 year ago

运行pretrain_cpm_bee.sh脚本 修改了dataset指定datasets.json

[
    {
        "dataset_name": "pretrain",
        "task_name": "mlm",
        "weight": 1.0,
        "path": "/home/litao/ScienGU/CPM-Bee/sciengu/zhinan/bin_data",
        "transforms": [
            {
                "answer": "$answer",
                "document": "$source"
            },
            {
                "answer": "$answer",
                "query": "$source"
            },
            {
                "answer": "$answer",
                "input": "$source"
            }
        ]
    }
]

里面的path,使其根据自己的数据进行处理 transhformers字段不太明白,希望能解释下

下面是引用的数据

{"answer": "当前现代医学的主要治疗甲状腺药物", "input": "当前现代医学的主要治疗甲状腺药物"}

下面是报错信息

Traceback (most recent call last):
  File "/home/share/wuhkjdxue30509/home/wust30509/.conda/envs/bmtrain/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/share/wuhkjdxue30509/home/wust30509/.conda/envs/bmtrain/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 932, in _mixed_dataset_process
    batch = packer.add_data(config[ds_id])
  File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 638, in add_data
    ) = self.build_instance(config)
  File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 439, in build_instance
    inp = ds.read()
  File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/dataset/distributed_dataset.py", line 554, in read
    next_block_id = self._get_next_block()
  File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/dataset/distributed_dataset.py", line 394, in _get_next_block
    raise RuntimeError("Empty dataset {}".format(self._path))
RuntimeError: Empty dataset /home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/sciengu/zhinan/bin_data
Process Process-1:
Traceback (most recent call last):
  File "/home/share/wuhkjdxue30509/home/wust30509/.conda/envs/bmtrain/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/share/wuhkjdxue30509/home/wust30509/.conda/envs/bmtrain/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 932, in _mixed_dataset_process
    batch = packer.add_data(config[ds_id])
  File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 638, in add_data
    ) = self.build_instance(config)
  File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 440, in build_instance
    inp = self.apply_transform(inp, transform)
  File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 344, in apply_transform
    _expand_mapping(data, [], src[1:].split("."), tgt.split("."))
  File "/home/share/wuhkjdxue30509/home/wust30509/CPM-Bee/src/cpm_live/training_tasks/bee/pretrain.py", line 338, in _expand_mapping
    _expand_mapping(data[path[0]], stars, path[1:], target)
KeyError: 'source'
fengcai24 commented 1 year ago

你好大佬,请问跑通了吗

ScienGU commented 1 year ago

没有啊,没人回复都

gongbaitao commented 1 year ago

您需要在执行preprocess_dataset.py的时候,在build_dataset和shuffle_dataset中将block_size设为一个较小的值,或增大您的数据集 transforms用于对数据变换,{"document": "$source"}表示把原始数据中的"source"字段替换到"document"字段中

nasame commented 11 months ago

您需要在执行preprocess_dataset.py的时候,在build_dataset和shuffle_dataset中将block_size设为一个较小的值,或增大您的数据集 transforms用于对数据变换,{"document": "$source"}表示把原始数据中的"source"字段替换到"document"字段中

大佬说的是对的,亲证可以。修改cpm_live/dataset/distributed_dataset.py中的DEFAULT_BLOCK_SIZE=16<<10