NER_CRF任务出现数据形状不一致错误

Thove commented 2 years ago

提问时请尽可能提供如下信息：

基本信息

你使用的操作系统: windows 10
你使用的Python版本: 3.6
你使用的Tensorflow版本: 1.14.0
你使用的Keras版本: 2.3.1
你使用的bert4keras版本: 0.11.4
你使用纯keras还是tf.keras: 纯keras
你加载的预训练模型:中文bert chinese_L-12_H-768_A-12

核心代码

# 请在此处贴上你的核心代码。
# 请尽量只保留关键部分，不要无脑贴全部代码。

被输入这个函数的traindata形状为【text，labels】。

class data_generator(DataGenerator):
    """数据生成器
    """
    def __iter__(self, random=False):
        batch_token_ids, batch_segment_ids, batch_labels = [], [], []
        for is_end, d in self.sample(random):
            tokens = tokenizer.tokenize(d[0], maxlen=maxlen)
            tokens_ids = tokenizer.tokens_to_ids(tokens)
            d1 = d[1]
            d1.insert(0, 0)
            d1.insert(-1, 0)
            lab = np.array(d1)
            seg = [0]* len(tokens_ids)
            batch_token_ids.append(tokens_ids)
            batch_segment_ids.append(seg)
            batch_labels.append(lab)
            if len(batch_token_ids) == self.batch_size or is_end:
                batch_token_ids = sequence_padding(batch_token_ids)
                batch_segment_ids = sequence_padding(batch_segment_ids)
                batch_labels = sequence_padding(batch_labels)
                yield [batch_token_ids, batch_segment_ids], batch_labels
                batch_token_ids, batch_segment_ids, batch_labels = [], [], []

输出信息

# 请在此处贴上你的调试输出
Epoch 1/10

   1/4358 [..............................] - ETA: 20:45:01 - loss: 191.2775 - sparse_accuracy: 0.0365
   2/4358 [..............................] - ETA: 12:53:31 - loss: 150.3369 - sparse_accuracy: 0.1306
   3/4358 [..............................] - ETA: 10:30:37 - loss: 122.2665 - sparse_accuracy: 0.2740Traceback (most recent call last):
  File "C:/Users/cypress/Desktop/nlp-master/nlp_induction_training/task4/preprosessing.py", line 256, in <module>
    epochs=epochs,
  File "D:\anaconda\envs\tensorflow\lib\site-packages\keras\engine\training.py", line 1147, in fit
    initial_epoch=initial_epoch)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\keras\engine\training.py", line 1732, in fit_generator
    initial_epoch=initial_epoch)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\keras\engine\training_generator.py", line 220, in fit_generator
    reset_metrics=False)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\keras\engine\training.py", line 1514, in train_on_batch
    outputs = self.train_function(ins)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\keras\backend.py", line 3292, in __call__
    run_metadata=self.run_metadata)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1458, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 850 values, but the requested shape has 840
     [[{{node loss/conditional_random_field_1_loss/sparse_loss/Reshape}}]]

自我尝试

不管什么问题，请先尝试自行解决，“万般努力”之下仍然无法解决再来提问。此处请贴上你的努力过程。我基本上没有改动太多原本代码，因此报错第一时间认为是我的预处理有问题，因此我尝试过修改多次datagenerater，此外我还把batch_size从32改成了10，把输入数据裁剪以适应512的最大长度，修改keras的引用而改用bert4keras.backend中的keras，修改dense层的大小，但都失败了我百思不得其解的是，为什么明明可以训练几个batch，却还在之后报错。我同样尝试过修改学习率为2e-6，这依然没有奏效。经过调试，我的数据生成器每次生成的三条数据都有着完美的一样大小。

i4never commented 2 years ago

看上去d[0]是输入文本d[1]是label、你能确保d[0]tokenize、转id后的长度与d[1]只差头尾的2个token吗

Thove commented 2 years ago

看上去d[0]是输入文本d[1]是label、你能确保d[0]tokenize、转id后的长度与d[1]只差头尾的2个token吗

非常感谢您的认真回答

Thove commented 2 years ago

这正是问题所在了

bojone commented 2 years ago

你这逻辑上就错了，先有tokenizer，然后对输入进行tokenize，然后根据tokenize的结果构建标签。你这是妄想tokenizer按照你所给标签进行对齐么？

bojone / bert4keras