pytorch自带的ctc loss似乎存在bug，改成第三方的实现可解决loss Nan和训练不收敛问题

flyindirt commented 4 years ago

我用的pytorch 1.2版，一开始用的是pytorch自带的ctc loss，训练中遇到了很多问题：1）训练了几个批次后，loss 为Nan，最后把批次大小改为1才能继续训练下去；2）用了RmsPop优化器，训练集loss下降才快起来，到了0.05以下，但是验证集准确率始终为0，且验证集loss在20以上（为了能看到效果，我把验证集改的跟训练集一样）。后来看到其他开源实现没有用pytorch自带的ctc loss，我试了一下，更换了第三方的ctc loss实现，结果发现效果一下子就出来了，每一个epoch，准确率都在上升。我用的第三方ctc loss的pytorch包装实现是 https://github.com/jpuigcerver/pytorch-baidu-ctc ，其中的第三方ctc从百度的https://github.com/baidu-research/warp-ctc 源码下载。直接编译会失败，参考编译百度ctc的解决办法（删除，改符号链接），可以编译成功。 I experienced this same issue and fixed it. The problem is that the ctc_entrypoint.cu file needs to be a symlink. So, go to src dir and run:
rm ctc_entrypoint.cu
ln -s ctc_entrypoint.cpp ctc_entrypoint.cu
Then run make. 在pytorch1.2下，编译和安装成功。希望对大家有帮助。

Sierkinhane commented 4 years ago

我用PyTorch 1.2.0的ctc loss不会出现NAN的情况，现在更新了仓库，训练一轮可以达到94%准确率

ingale726 commented 3 years ago

我的见解 https://github.com/Sierkinhane/CRNN_Chinese_Characters_Rec/issues/281

Sierkinhane / CRNN_Chinese_Characters_Rec

pytorch自带的ctc loss似乎存在bug，改成第三方的实现可解决loss Nan和训练不收敛问题 #185