PaddlePaddle / Knover

Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle
Apache License 2.0
673 stars 131 forks source link

使用single_gpu训练报错TypeError: __new__() got multiple values for argument data_id #171

Closed what-is-perfect closed 1 year ago

what-is-perfect commented 1 year ago

我只有一张显卡,因此我是用Knover/scripts/single_gpu/train.sh进行训练。 训练时发现在训练epoch1时训练顺利进行,但在某个阶段,读数据时报错其显示使用Tread1读取数据时发现data_id有多个值。 读数据程序位于“Knover/knover/data/dialog_reader.py” 报错信息如下: [train][1] progress: 1/1 step: 6524, time: 0.753, queue size: 64, speed: 1.327 steps/sException in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/usr/lib/python3.7/threading.py", line 870, in run self._target(*self._args, *self._kwargs) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/reader.py", line 1442, in __thread_main__ six.reraise(sys.exc_info()) File "/usr/local/lib/python3.7/dist-packages/six.py", line 719, in reraise raise value File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/reader.py", line 1422, in thread_main for tensors in self._tensor_reader(): File "/root/Knover/knover/core/model.py", line 382, in wrapper__ for batch in generator(): File "/root/Knover/knover/data/dialog_reader.py", line 578, in wrapper for batch in batch_reader(): File "/root/Knover/knover/data/dialog_reader.py", line 517, in wrapper for batch in batch_reader(): File "/root/Knover/knover/data/dialog_reader.py", line 483, in wrapper__ for record in reader(): File "/root/Knover/knover/data/dialog_reader.py", line 397, in wrapper for record in file_reader(): File "/root/Knover/knover/data/dialog_reader.py", line 372, in wrapper for example in gen_examples(): File "/root/Knover/knover/data/dialog_reader.py", line 339, in wrapper example = Example(*line, data_id=self.data_id) TypeError: new() got multiple values for argument 'data_id'

Traceback (most recent call last): File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/Knover/knover/scripts/train.py", line 319, in train(args) File "/root/Knover/knover/scripts/train.py", line 146, in train for step, data in enumerate(train_generator(), args.start_step + 1): File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/reader.py", line 1398, in next return self._reader.readnext() SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception. [Hint: Expected killed != true, but received killed_:1 == true:1.] (at /paddle/paddle/fluid/operators/reader/blocking_queue.h:166)

    current lr: 0.0000392
    lm_loss: 3.1958, ppl: 24.4298, loss: 3.1958

[train][1] progress: 1/1 step: 6525, time: 0.824, queue size: 64, speed: 1.214 steps/s current lr: 0.0000392 lm_loss: 3.2496, ppl: 25.7806, loss: 3.2496

目前我修改使用distuributed/train.sh做训练。这是因为我之前使用finetune时,使用single_gpu脚本报错,使用distuributed脚本则完成finetune,现在还没有发现报错。

请问谁知道这个问题是怎样引起的以及怎样解决啊?

what-is-perfect commented 1 year ago

我想我找到了原因,目前正在做测试:我的训练是在Docker中进行的,由于前期没有设置(共享内存)shm-size,所以我的container中shm-size=64M,当高速读数据时,可能导致上述错误。shm-size可通过df -h查询,其为shm行,我扩大为原size100倍,目前尚未报错。相关修改方式可参考:https://blog.csdn.net/weixin_44966641/article/details/123930747。 如有相同报错,烦请在我表明完成验证后再做修改

what-is-perfect commented 1 year ago

很遗憾不是上面的问题,有知道出发这个问题的原因和解决方法的朋友吗?烦请帮忙解答一下,谢谢!!!

sserdoubleh commented 1 year ago

可以确认下数据格式,是不是\t的数量有异常,数据里面包含\t字符

what-is-perfect commented 1 year ago

我模拟了整个数据读入的过程,数据读取程序没有问题,然后按照@sserdoubleh的方案,检查了数据文档,原因是原始数据中使用了中国医药聊天数据,从其中按列读取的数据仍有部分包含额外的table符号,因此在模型训练时触发上述错误,先基本确认解决问题,感谢@sserdoubleh