文本分类样例代码GPU运行报错

fastnlp / fastNLP

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

https://gitee.com/fastnlp/fastNLP

Apache License 2.0

3.06k stars 450 forks source link

文本分类样例代码GPU运行报错 #349

Open 503718696 opened 3 years ago

503718696 commented 3 years ago

文本分类样例代码在CPU模式下正常运行，但是GPU运行报错，错误如下： RuntimeError: 'lengths' argument should be a 1D CPU int64 tensor, but got 1D cuda:0 Long tensor

yhcc commented 3 years ago

这个是由于pytorch 1.5以上BiLSTM升级导致的。可以在https://github.com/pytorch/pytorch/issues/43227 这里看到。最新的fastNLP已经适配这个了。通过pip install git+https://gitee.com/fastnlp/fastNLP@dev 可以解决。

MrRace commented 3 years ago

这个是由于pytorch 1.5以上BiLSTM升级导致的。可以在pytorch/pytorch#43227 这里看到。最新的fastNLP已经适配这个了。通过pip install git+https://gitee.com/fastnlp/fastNLP@dev 可以解决。

按照你给的这个方案安装 dev后依然报错。。。。。。RuntimeError: 'lengths' argument should be a 1D CPU int64 tensor, but got 1D cuda:0 Long tensor. 使用的torch版本是1.7.1

yamonc commented 3 years ago

pytroch 1.6 能训练，但是刚开始训练没多久就会报这个错误：RuntimeError: transform: failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure，然后再训练会报cuDNN error: CUDNN_STATUS_MAPPING_ERROR同时训练终止，无法完成训练。

yhcc commented 3 years ago

这个是由于pytorch 1.5以上BiLSTM升级导致的。可以在pytorch/pytorch#43227 这里看到。最新的fastNLP已经适配这个了。通过pip install git+https://gitee.com/fastnlp/fastNLP@dev 可以解决。

按照你给的这个方案安装 dev后依然报错。。。。。。RuntimeError: 'lengths' argument should be a 1D CPU int64 tensor, but got 1D cuda:0 Long tensor. 使用的torch版本是1.7.1

可以先pip unisntall fastNLP再安装试一下。

yhcc commented 3 years ago

pytroch 1.6 能训练，但是刚开始训练没多久就会报这个错误：RuntimeError: transform: failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure，然后再训练会报cuDNN error: CUDNN_STATUS_MAPPING_ERROR同时训练终止，无法完成训练。

训练其它非fastNLP的代码正常吗？这种有可能是某些包安装不完整导致的。

yamonc commented 3 years ago

pytroch 1.6 能训练，但是刚开始训练没多久就会报这个错误：RuntimeError: transform: failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure，然后再训练会报cuDNN error: CUDNN_STATUS_MAPPING_ERROR同时训练终止，无法完成训练。

训练其它非fastNLP的代码正常吗？这种有可能是某些包安装不完整导致的。

刚才试了，文本分类最下方有一个bert预训练的模型，使用Bert进行文本分类，bert可以使用，但是上面的那个文本分类就不行。

yamonc commented 3 years ago

还有一个小问题，注释写的有错误，在这里fastNLP中的Vocabulary中的构建Vocabulary中，vocab.to_index('复')应该是2，第0个是pad标签，第1个是unk标签，第2个是输入的第一个字，也就是复，所以返回的是2

yhcc commented 3 years ago

还有一个小问题，注释写的有错误，在这里fastNLP中的Vocabulary中的构建Vocabulary中，vocab.to_index('复')应该是2，第0个是pad标签，第1个是unk标签，第2个是输入的第一个字，也就是复，所以返回的是2

好的，感谢你的细致观察～

yhcc commented 3 years ago

pytroch 1.6 能训练，但是刚开始训练没多久就会报这个错误：RuntimeError: transform: failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure，然后再训练会报cuDNN error: CUDNN_STATUS_MAPPING_ERROR同时训练终止，无法完成训练。

训练其它非fastNLP的代码正常吗？这种有可能是某些包安装不完整导致的。

刚才试了，文本分类最下方有一个bert预训练的模型，使用Bert进行文本分类，bert可以使用，但是上面的那个文本分类就不行。

好的，我们尽快研究一下。

zxjlm commented 3 years ago

请问这个问题现在解决了吗

pytorch 1.8.1
FastNLP 0.6.0

在进行训练时遇到了同样的问题，请问是需要将pytorch降级到1.6或者更低吗？

yhcc commented 3 years ago

请问这个问题现在解决了吗
pytorch 1.8.1
FastNLP 0.6.0
在进行训练时遇到了同样的问题，请问是需要将pytorch降级到1.6或者更低吗？

使用的是pip install git+https://gitee.com/fastnlp/fastNLP@dev 这个fastNLP吗？还是直接pip install fastNLP的

zxjlm commented 3 years ago

使用的是pip install git+https://gitee.com/fastnlp/fastNLP@dev 这个fastNLP吗？还是直接pip install fastNLP的

使用的 pip install git+https://gitee.com/fastnlp/fastNLP@dev

yhcc commented 3 years ago

建议您pip uninstall fastNLP然后pip install git+https://gitee.com/fastnlp/fastNLP@dev --force-reinstall试一试，我在pytorch1.7上没能复现这个问题。