[Bug]: question_generation/unimo-text GPU训练报错

YYGe01 commented 1 year ago

软件环境

- paddlepaddle:2.3.2
- paddlepaddle-gpu: 2.3.2.post116
- paddlenlp: paddleNLP/develop
- windows10

重复问题

[X] I have searched the existing issues

错误描述

问题：question_generation/unimo-text/train.py 运行报错。
cpu训练没问题，gpu训练有问题. 同样问题见：
https://github.com/PaddlePaddle/PaddleOCR/issues/6936
https://github.com/PaddlePaddle/PaddleDetection/issues/6252
我看有位朋友说降paddle版本，我这边是11.6的cuda，不知道降版本后是否支持。

稳定复现步骤 & 代码

dureader_qg数据集有点大，从里面拿了一个样本进行训，并复制10遍，同时作为train.json和test.json，如下；

{"context": "欠条是永久有效的,未约定还款期限的借款合同纠纷,诉讼时效自债权人主张债权之日起计算,时效为2年。 根据《中华人民共和国民法通则》第一百三十五条:向人民法院请求保护民事权利的诉讼时效期间为二年,法律另有规定的除外。 第一百三十七条:诉讼时效期间从知道或者应当知道权利被侵害时起计算。但是,从权利被侵害之日起超过二十年的,人民法院不予保护。有特殊情况的,人民法院可以延长诉讼时效期间。 第六十二条第(四)项:履行期限不明确的,债务人可以随时履行,债权人也可以随时要求履行,但应当给对方必要的准备时间。","answer": "永久有效", "question": "欠条的有效期是多久"}

Eval begin...
Error: ../paddle/phi/kernels/funcs/elementwise_functor.h:545 Assertion `b != 0` failed. InvalidArgumentError: Integer division by zero encountered in (floor) divide. Please check the input value.
...
Error: ../paddle/phi/kernels/funcs/elementwise_functor.h:545 Assertion `b != 0` failed. InvalidArgumentError: Integer division by zero encountered in (floor) divide. Please check the input value.
Traceback (most recent call last):
  File "C:/Users/49476/glory/work/learn/PaddleNLP/examples/question_generation/unimo-text/train.py", line 303, in <module>
    run(args)
  File "C:/Users/49476/glory/work/learn/PaddleNLP/examples/question_generation/unimo-text/train.py", line 222, in run
    bleu4 = evaluation(model_eval, dev_data_loader,
  File "C:\Users\49476\anaconda3\envs\test_paddlenlp\lib\site-packages\decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "C:\Users\49476\anaconda3\envs\test_paddlenlp\lib\site-packages\paddle\fluid\dygraph\base.py", line 354, in _decorate_function
    return func(*args, **kwargs)
  File "C:/Users/49476/glory/work/learn/PaddleNLP/examples/question_generation/unimo-text/train.py", line 251, in evaluation
    ids, scores = model.generate(
  File "C:\Users\49476\anaconda3\envs\test_paddlenlp\lib\site-packages\decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "C:\Users\49476\anaconda3\envs\test_paddlenlp\lib\site-packages\paddle\fluid\dygraph\base.py", line 354, in _decorate_function
    return func(*args, **kwargs)
  File "C:\Users\49476\glory\work\learn\PaddleNLP\paddlenlp\transformers\generation_utils.py", line 942, in generate
    return self.beam_search(input_ids, beam_scorer,
  File "C:\Users\49476\glory\work\learn\PaddleNLP\paddlenlp\transformers\generation_utils.py", line 1164, in beam_search
    beam_outputs = beam_scorer.process(
  File "C:\Users\49476\glory\work\learn\PaddleNLP\paddlenlp\transformers\generation_utils.py", line 152, in process
    if self._done[batch_idx] == 1:
  File "C:\Users\49476\anaconda3\envs\test_paddlenlp\lib\site-packages\paddle\fluid\dygraph\varbase_patch_methods.py", line 669, in __bool__
    return self.__nonzero__()
  File "C:\Users\49476\anaconda3\envs\test_paddlenlp\lib\site-packages\paddle\fluid\dygraph\varbase_patch_methods.py", line 666, in __nonzero__
    return bool(np.all(tensor.__array__() > 0))
OSError: (External) CUDA error(719), unspecified launch failure. 
  [Hint: 'cudaErrorLaunchFailure'. An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointerand accessing out of bounds shared memory. Less common cases can be system specific - more information about these cases canbe found in the system specific user guide. This leaves the process in an inconsistent state and any further CUDA work willreturn the same error. To continue using CUDA, the process must be terminated and relaunched.] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:258)

YYGe01 commented 1 year ago

cuda信息如下：

import paddle
paddle.utils.run_check()

输出:
W1121 16:17:49.512596 39064 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.6, Runtime API Version: 11.6
W1121 16:17:49.517609 39064 gpu_resources.cc:91] device: 0, cuDNN Version: 8.6.
PaddlePaddle works well on 1 GPU.
PaddlePaddle works well on 1 GPUs.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

Process finished with exit code 0

westfish commented 1 year ago

win上建议安装paddle2.2.2

PaddlePaddle / PaddleNLP

[Bug]: question_generation/unimo-text GPU训练报错 #3840

软件环境

重复问题

错误描述

稳定复现步骤 & 代码