数据预处理出错 - Githubissues

xuguozhi commented 1 year ago

Traceback (most recent call last): File "/home/Bloom-Lora/processor/processing.py", line 108, in instruction_dataset = instruction_dataset.map(group_text, File "/usr/local/python3/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/usr/local/python3/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, **kwargs) File "/usr/local/python3/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map for rank, done, content in iflatmap_unordered( File "/usr/local/python3/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in iflatmap_unordered [async_result.get() for async_result in async_results] File "/usr/local/python3/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in [async_result.get() for async_result in async_results] File "/usr/local/python3/lib/python3.9/site-packages/multiprocess/pool.py", line 771, in get raise self._value KeyError: 'target' belle 数据集没有target

Macielyoung commented 1 year ago

belle数据集列名改了，使用datasets里面的rename_column方法重命名一下就可以了。具体可以看文档：https://huggingface.co/docs/datasets/process。

xuguozhi commented 1 year ago

用了一个简单的方法绕过： instruction_dataset = concatenate_datasets([guanaco_dataset, alpaca_dataset]) instruction_dataset = instruction_dataset.map(group_text, batched=True, batch_size=50, num_proc=10, remove_columns=instruction_dataset.column_names) instruction_dataset = concatenate_datasets([instruction_dataset, belle_dataset1, belle_dataset2])

xuguozhi commented 1 year ago

还有一个问题，这里的处理没有采用tokenizer，参考 https://github.com/yanqiangmiffy/InstructGLM/blob/master/tokenize_dataset_rows.py#L12 请问有什么区别

Macielyoung commented 1 year ago

这里预处理主要是合并多方数据，统一成instruction，input，output这一类的格式。具体tokenizer和prompter会在trainer脚本中使用。

xuguozhi commented 1 year ago

嗯，看了好的，请问predict的结果可以放出来看看吗？

Macielyoung commented 1 year ago

模型还在训练，T4卡有点慢。中间结果看了对于一些长文本生成的会存在重复的问题，以及存在一本正经胡说八道的问题。不过对于一些短文本分类、简单改写还可以。

xuguozhi commented 1 year ago

https://github.com/yangjianxin1/Firefly 这里的数据能否用起来

Macielyoung commented 1 year ago

可以的

Macielyoung / Bloom-Lora

数据预处理出错 #3