bojone / bert4keras

keras implement of transformers for humans
https://kexue.fm/archives/6915
Apache License 2.0
5.37k stars 927 forks source link

分类问题loss为nan #164

Open xuxuanbo opened 4 years ago

xuxuanbo commented 4 years ago

卡在这里很久了,希望苏神赐教!

基本信息

核心代码

    DataSet = load_data()
    train_data,valid_data,test_data = split_dataset(DataSet.data)
    train_data.dropna()
    tokenizer = Tokenizer(dict_path, do_lower_case=True)

    bert = build_transformer_model(
        config_path=config_path,
        checkpoint_path=checkpoint_path,
        model=config['BERT_CONFIG']['model'],
        return_keras_model=False,
    )

    output = Lambda(lambda x: x[:, 0], name='CLS-token')(bert.model.output)
    output = Dense(
        units=num_classes,
        activation='softmax',
        kernel_initializer=bert.initializer
    )(output)
    output = Lambda(lambda x: x+1e-8)(output)
    model = keras.models.Model(bert.model.input, output)
    model.summary()

    AdamLR = extend_with_piecewise_linear_lr(Adam, name='AdamLR')

    model.compile(
        loss='sparse_categorical_crossentropy',
        # optimizer=Adam(1e-5,  
        optimizer=AdamLR(lr=1e-4, lr_schedule={
            1000: 1,
            2000: 0.1
        }),
        # optimizer='sgd',
        metrics=['accuracy'],
    )

    train_generator = data_generator(train_data, batch_size)
    valid_generator = data_generator(valid_data, batch_size)
    test_generator = data_generator(test_data, batch_size)
    evaluator = Evaluator()
    model.fit_generator(
        train_generator.forfit(),
        steps_per_epoch=len(train_generator),
        epochs=20,
        callbacks=[evaluator]
    )

    model.load_weights('best_model.weights')
    print(u'final test acc: %05f\n' % (evaluate(test_generator)))
    #

### 输出信息
```shell
数据为22000条左右,batch_size=32,lr=1e-4时如下(当将batch_size调整地非常小,并且学习率也很小的时候,起初的几个batch的loss不为nan):
__________________________________________________________________________________________________
Epoch 1/20

  1/351 [..............................] - ETA: 35:48 - loss: 3.2497 - acc: 0.0000e+00
  3/351 [..............................] - ETA: 12:05 - loss: 3.2328 - acc: 0.0000e+00
  4/351 [..............................] - ETA: 9:09 - loss: 3.2325 - acc: 0.0000e+00 
  5/351 [..............................] - ETA: 7:22 - loss: 3.2265 - acc: 0.0000e+00
  7/351 [..............................] - ETA: 5:19 - loss: 3.2272 - acc: 0.0179    
  8/351 [..............................] - ETA: 4:41 - loss: 3.2187 - acc: 0.0312
 10/351 [..............................] - ETA: 3:46 - loss: 3.2213 - acc: 0.0281
 11/351 [..............................] - ETA: 3:26 - loss: 3.2241 - acc: 0.0284
数据为37000条左右,batch_size=32,lr=1e-4时如下:
__________________________________________________________________________________________________
Epoch 1/20

  1/582 [..............................] - ETA: 1:05:03 - loss: nan - acc: 0.0938
  2/582 [..............................] - ETA: 32:52 - loss: nan - acc: 0.0938  
  3/582 [..............................] - ETA: 22:09 - loss: nan - acc: 0.0833
  4/582 [..............................] - ETA: 16:45 - loss: nan - acc: 0.0938
  6/582 [..............................] - ETA: 11:16 - loss: nan - acc: 0.0781
  7/582 [..............................] - ETA: 9:43 - loss: nan - acc: 0.0759 

自我尝试

在此基础上做了几组实验排查原因,由于实验思路不一定正确,因此附上实验步骤的描述

bojone commented 4 years ago

检查过所有数据,确认标签都是在[0, num_classes)范围内了吗?

xuxuanbo commented 4 years ago

检查过了,增加的数据仅是用pandas和numpy做了复制操作而已

bojone commented 4 years ago

方便提供一份可复现的代码和数据吗?如果可以的话发到我邮箱,我来调试下

xuxuanbo commented 4 years ago

谢谢苏神!不过是公司的代码和数据,所以没有办法,我再调试调试,如果找出原因,会再在issue下回复,感谢!

bojone commented 4 years ago

好的,但是我还是感觉是数据异常的问题。如果真的是数据量,你可以试试几条数据复制几万份看看。

xuxuanbo commented 4 years ago

没错,我也感觉应该是这个原因,回去看了一下自己建数据集的代码,是先复制数据再划分训练集测试集和验证集的划分,并且是固定随机种子。因此很有可能是之前脏数据碰巧没有被划分到训练集中,增多数据后被划分进了训练集中。验证后会再来回复,再次感谢!

xuxuanbo commented 4 years ago

用7条数据复制了几万份,确实没有出现为nan的问题,这么看应该是数据的问题了。但是我将一整份数据放入到网络当中,并且将batch_size调为1,但是没有哪个batch出现为nan的情况,不知道苏神对这种情况还有什么建议吗

zhouygg commented 4 years ago

这个应该不只是数据的问题,模型也有关系。我是先用垂直领域的语料训练albert模型,然后再做下游分类任务。同样的下游任务,如果预训练过程过拟合越严重,下游分类任务出现nan的概率就越高