bojone / bert4keras

keras implement of transformers for humans
https://kexue.fm/archives/6915
Apache License 2.0
5.37k stars 927 forks source link

bert4keras0.5.8运行出错求解决~ #443

Closed cingtiye closed 2 years ago

cingtiye commented 2 years ago

提问时请尽可能提供如下信息:

基本信息

核心代码

class LogRecord(keras.callbacks.Callback):
    def __init__(self):
        super(LogRecord, self).__init__()
        self._step = 1
        self.lowest = 1e10
        self.test_iter = data_input.get_sample(
            3,
            need_shuffle=False,
            cycle=True
        )
        self.response = Response(model_cls.model,
                                 model_cls.session,
                                 data_input,
                                 start_id=None,
                                 end_id=data_input.tokenizer._token_sep_id,
                                 maxlen=30
                                 )

    def on_epoch_end(self, epoch, logs=None):
        for i in range(2):
            sample = next(self.test_iter)
            res = self.response.generate(sample)
            logger.info('==============')
            logger.info('Context: {}'.format(sample['history']))
            logger.info('Goal: {}'.format(sample['goal']))
            logger.info('Answer: {}\n'.format(res))
            for j in range(7):
                # 很多重复的
                next(self.test_iter)

    def on_batch_end(self, batch, logs=None):
        self._step += 1
        if self._step % 20 == 0:
            logger.info('step: {}  loss: {} '.format(self._step, logs['loss']))

checkpoint_callback = keras.callbacks.ModelCheckpoint(
    save_path, monitor='val_loss', verbose=0, save_best_only=False,
    save_weights_only=True, mode='min', period=3)
tensorboard_callback = keras.callbacks.TensorBoard(
    log_dir=join(save_dir, 'tf_logs'), histogram_freq=0, write_graph=False,
    write_grads=False, update_freq=320)

model_cls.model.fit_generator(
    data_input.generator(
        batch_size=batch_size,
        data_type=train_list,
        need_shuffle=True,
        cycle=True
    ),
    validation_data=data_input.generator(
        batch_size=batch_size,
        data_type=1,
        need_shuffle=True,
        cycle=True
    ),
    validation_steps=10,
    validation_freq=1,
    steps_per_epoch=steps_per_epoch,
    epochs=epoches,
    initial_epoch=init_epoch,
    verbose=2,
    class_weight=None,
    callbacks=[
        checkpoint_callback,
        tensorboard_callback,
        LogRecord()
    ]
)

输出信息

2022-02-23 14:37:15.198334: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2022-02-23 14:38:21.092821: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
2022-02-23 14:38:21.092866: E tensorflow/stream_executor/cuda/cuda_blas.cc:2301] Internal: failed BLAS call, see log for details
Traceback (most recent call last):
  File "train/train_bert_lm.py", line 111, in <module>
    LogRecord()
  File "/usr/local/miniconda3/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/miniconda3/lib/python3.6/site-packages/keras/engine/training.py", line 1732, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/miniconda3/lib/python3.6/site-packages/keras/engine/training_generator.py", line 220, in fit_generator
    reset_metrics=False)
  File "/usr/local/miniconda3/lib/python3.6/site-packages/keras/engine/training.py", line 1514, in train_on_batch
    outputs = self.train_function(ins)
  File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 3292, in __call__
    run_metadata=self.run_metadata)
  File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1458, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas xGEMMBatched launch failed : a.shape=[48,380,64], b.shape=[48,64,380], m=380, n=380, k=64, batch_size=48
         [[{{node Transformer-1-MultiHeadSelfAttention/einsum/MatMul}}]]
         [[Mean/_3291]]
  (1) Internal: Blas xGEMMBatched launch failed : a.shape=[48,380,64], b.shape=[48,64,380], m=380, n=380, k=64, batch_size=48
         [[{{node Transformer-1-MultiHeadSelfAttention/einsum/MatMul}}]]
0 successful operations.
0 derived errors ignored.

自我尝试

nvcc -V 时cuda版本为10.0 nvidia-smi时cuda版本为11.2 尝试添加 os.environ['TF_KERAS'] = '1'时仍然报错~ 尝试改变tf的版本(<2.0)时仍然报错~ 尝试安装nvidia-tensorflow,发现很多依赖(逐一)安装不了~ 尝试在cuda11.x运行此代码,但是需要安装tf2.x才可以,显示keras.backend.set_session() 没有 set_session() 这个属性~ 如果将(nvidia-smi的)11.2降为10.0,我想应该不会出现这个问题,但是不被允许这样做~

请问大神有什么解决办法~ 万分感激~

zzuchen commented 2 years ago

您好,我跟你的问题一样,请问您最后是怎么解决的呢?

cingtiye commented 2 years ago

1、CUDA为10.0 image 2、nvcc -V为10.0 image

zzuchen commented 2 years ago

是将(nvidia-smi的)11.2降为10.0吗?好像说RTX3090显卡与cuda10不兼容,请问您的显卡是这个吗?

cingtiye commented 2 years ago

我用的是1080ti的显卡~

zzuchen commented 2 years ago

嗯嗯,好的,谢谢啦~

nameless0704 commented 1 year ago

嗯嗯,好的,谢谢啦~

您好请问你之后解决了嘛?是用这种方法吗?是如何降级nvidia-smi的?因为我重装了好多次CUDA,看了这个StackOverflow帖子 它说这俩不一致没问题,smi只是给出驱动的最高版本,我运行别的也没问题,tf.test.is_gpu_available()也都正常,但是还是报这个错,想知道您最后怎么解决的呀?

cingtiye commented 1 year ago

嗯嗯,好的,谢谢啦~

您好请问你之后解决了嘛?是用这种方法吗?是如何降级nvidia-smi的?因为我重装了好多次CUDA,看了这个StackOverflow帖子 它说这俩不一致没问题,smi只是给出驱动的最高版本,我运行别的也没问题,tf.test.is_gpu_available()也都正常,但是还是报这个错,想知道您最后怎么解决的呀?

重装cuda没啥用 我换了张显卡