fastnlp / fastNLP

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.
https://gitee.com/fastnlp/fastNLP
Apache License 2.0
3.05k stars 451 forks source link

使用Trainer时遇到的一个错误 #404

Open warrior-yyyan opened 2 years ago

warrior-yyyan commented 2 years ago

在py3.9, torch1.11下,使用Trainer报了一个错误: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [32, 50, 711]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). 使用DataSetIter自定义训练时就不会报错,去网上查了查这个错误的解决方案,大概是inplace的改动导致的,是因为torch版本的问题导致的吗?在高本版torch下如果还想直接使用Trainer而不是自定义训练,该如何解决呢?

yhcc commented 2 years ago

从报错来看是由于网络中存在ReLu,并且在设置了其inplace=True,你可以检查下网络中有这个问题么?另外,在device='cpu'的情况下可以运行嘛?或者报错是什么,有可能cuda场景下,真正出错的地方不是raise的地方。