bojone / bert4keras

keras implement of transformers for humans
https://kexue.fm/archives/6915
Apache License 2.0
5.37k stars 926 forks source link

NER_CRF任务出现数据形状不一致错误 #491

Open Thove opened 2 years ago

Thove commented 2 years ago

提问时请尽可能提供如下信息:

基本信息

核心代码

# 请在此处贴上你的核心代码。
# 请尽量只保留关键部分,不要无脑贴全部代码。

被输入这个函数的traindata形状为【text,labels】。

class data_generator(DataGenerator):
    """数据生成器
    """
    def __iter__(self, random=False):
        batch_token_ids, batch_segment_ids, batch_labels = [], [], []
        for is_end, d in self.sample(random):
            tokens = tokenizer.tokenize(d[0], maxlen=maxlen)
            tokens_ids = tokenizer.tokens_to_ids(tokens)
            d1 = d[1]
            d1.insert(0, 0)
            d1.insert(-1, 0)
            lab = np.array(d1)
            seg = [0]* len(tokens_ids)
            batch_token_ids.append(tokens_ids)
            batch_segment_ids.append(seg)
            batch_labels.append(lab)
            if len(batch_token_ids) == self.batch_size or is_end:
                batch_token_ids = sequence_padding(batch_token_ids)
                batch_segment_ids = sequence_padding(batch_segment_ids)
                batch_labels = sequence_padding(batch_labels)
                yield [batch_token_ids, batch_segment_ids], batch_labels
                batch_token_ids, batch_segment_ids, batch_labels = [], [], []

输出信息

# 请在此处贴上你的调试输出
Epoch 1/10

   1/4358 [..............................] - ETA: 20:45:01 - loss: 191.2775 - sparse_accuracy: 0.0365
   2/4358 [..............................] - ETA: 12:53:31 - loss: 150.3369 - sparse_accuracy: 0.1306
   3/4358 [..............................] - ETA: 10:30:37 - loss: 122.2665 - sparse_accuracy: 0.2740Traceback (most recent call last):
  File "C:/Users/cypress/Desktop/nlp-master/nlp_induction_training/task4/preprosessing.py", line 256, in <module>
    epochs=epochs,
  File "D:\anaconda\envs\tensorflow\lib\site-packages\keras\engine\training.py", line 1147, in fit
    initial_epoch=initial_epoch)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\keras\engine\training.py", line 1732, in fit_generator
    initial_epoch=initial_epoch)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\keras\engine\training_generator.py", line 220, in fit_generator
    reset_metrics=False)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\keras\engine\training.py", line 1514, in train_on_batch
    outputs = self.train_function(ins)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\keras\backend.py", line 3292, in __call__
    run_metadata=self.run_metadata)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1458, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 850 values, but the requested shape has 840
     [[{{node loss/conditional_random_field_1_loss/sparse_loss/Reshape}}]]

自我尝试

不管什么问题,请先尝试自行解决,“万般努力”之下仍然无法解决再来提问。此处请贴上你的努力过程。 我基本上没有改动太多原本代码,因此报错第一时间认为是我的预处理有问题,因此我尝试过修改多次datagenerater,此外我还把batch_size从32改成了10,把输入数据裁剪以适应512的最大长度,修改keras的引用而改用bert4keras.backend中的keras,修改dense层的大小,但都失败了 我百思不得其解的是,为什么明明可以训练几个batch,却还在之后报错。我同样尝试过修改学习率为2e-6,这依然没有奏效。 经过调试,我的数据生成器每次生成的三条数据都有着完美的一样大小。

i4never commented 2 years ago

看上去d[0]是输入文本d[1]是label、你能确保d[0]tokenize、转id后的长度与d[1]只差头尾的2个token吗

Thove commented 2 years ago

看上去d[0]是输入文本d[1]是label、你能确保d[0]tokenize、转id后的长度与d[1]只差头尾的2个token吗

非常感谢您的认真回答

Thove commented 2 years ago

这正是问题所在了

bojone commented 2 years ago

你这逻辑上就错了,先有tokenizer,然后对输入进行tokenize,然后根据tokenize的结果构建标签。你这是妄想tokenizer按照你所给标签进行对齐么?