关于NER_CRF任务改成多GPU版报错的问题

paddydai commented 2 years ago

文本分类的多GPU版例子运行正常，但是把task_sequence_labeling_ner_crf.py改造成单机多卡版的时候出错了，多了一个CRF层，尝试了很多都不行。 `

class data_generator(DataGenerator):
    """数据生成器
       (每次只需要返回一条样本)
    """
    def __iter__(self, random=False):
        for is_end, item in self.sample(random):
            ......
            yield [token_ids, segment_ids], [labels]
strategy = tf.distribute.MirroredStrategy()  # 建立单机多卡策略
with strategy.scope():  # 调用该策略
    bert = build_transformer_model(
        config_path,
        checkpoint_path=None, # 此时可以不加载预训练权重
        return_keras_model=False,  # 返回bert4keras类，而不是keras模型
    )

    model = bert.model  # 这个才是keras模型
    output_layer = 'Transformer-%s-FeedForward-Norm' % (bert_layers - 1)
    output = model.get_layer(output_layer).output
    output = Dense(num_labels, name='out')(output)
    CRF = ConditionalRandomField(lr_multiplier=crf_lr_multiplier, name='crf')
    output = CRF(output)

    model = Model(model.input, output)
    model.compile(loss=CRF.sparse_loss,
              optimizer=Adam(learing_rate),
              metrics=[CRF.sparse_accuracy])
    model.summary()
    bert.load_weights_from_checkpoint(checkpoint_path)  # 必须最后才加载预训练权重

pos_num, neg_num, train_data = load_data('data/brand_sample_yiliao.val')
train_generator = data_generator(train_data, batch_size)
sample_len = math.ceil((pos_num + neg_sample_rate * neg_num) / batch_size)
dataset = train_generator.to_dataset(
    types=[('float32', 'float32'), ('float32',)],
    shapes=[([None], [None]), ([None],)], # 配合后面的padded_batch=True，实现自动padding
    names=[('Input-Token', 'Input-Segment'), ('crf',)],
    padded_batch=True
) # 数据要转为tf.data.Dataset格式，names跟输入层的名字对应
model.fit(dataset,
        steps_per_epoch=int(sample_len / split_num),
        verbose = 2,
        epochs=epochs)

`

报错信息： tensorflow.python.framework.errors_impl.FailedPreconditionError: 2 root error(s) found. (0) Failed precondition: Error while reading resource variable out/bias/replica_2 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/out/bias/replica_2/N10tensorflow3VarE does not exist. [[{{node replica_2/out/BiasAdd/ReadVariableOp}}]] (1) Failed precondition: Error while reading resource variable out/bias/replica_2 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/out/bias/replica_2/N10tensorflow3VarE does not exist. [[{{node replica_2/out/BiasAdd/ReadVariableOp}}]] [[GroupCrossDeviceControlEdges_0/Adam/Adam/update_2/Const/_12024]] 有时候报这个错： tensorflow.python.framework.errors_impl.FailedPreconditionError: 2 root error(s) found. (0) Failed precondition: Error while reading resource variable crf/trans from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/crf/trans/N10tensorflow3VarE does not exist. [[{{node loss_1/crf_loss/ReadVariableOp_1}}]] [[replica_2/loss/mul_1/_11943]] (1) Failed precondition: Error while reading resource variable crf/trans from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/crf/trans/N10tensorflow3VarE does not exist. [[{{node loss_1/crf_loss/ReadVariableOp_1}}]] 0 successful operations. 3 derived errors ignored. 麻烦苏神帮忙看看

paddydai commented 2 years ago

根据错误提示，在with strategy.scope()代码快之后加了几行代码就可以了，不知道为啥？ `

session = keras.backend.get_session()

init = tf.global_variables_initializer()

session.run(init)

`

paddydai commented 2 years ago

这个是tensorflow1.14的问题，需要升级到TensorFlow2+才能解决

bojone / bert4keras

关于NER_CRF任务改成多GPU版报错的问题 #429