MaybeShewill-CV / CRNN_Tensorflow

Convolutional Recurrent Neural Networks(CRNN) for Scene Text Recognition
MIT License
1.03k stars 388 forks source link

多gpu训练 #382

Closed zgsxwsdxg closed 4 years ago

zgsxwsdxg commented 4 years ago

您好!我想问一下您,你的训练代码,多gpu训练train_shadownet.py 中 每个gpu上放的训练数据是同一批,我感觉这个是不是有点问题,多gpu训练应该每个gpu上放不同的数据吧?

期待您的回复,谢谢!

MaybeShewill-CV commented 4 years ago

@zgsxwsdxg 是的 有时间的话我会检查这个问题 如果错了的话会及时修改:)

MaybeShewill-CV commented 4 years ago

@zgsxwsdxg 这个问题已经检查了 之前是使用了重复的数据 现在已经修复:)

zgsxwsdxg commented 4 years ago

谢谢,作者好认真,赞

zgsxwsdxg commented 4 years ago

@zgsxwsdxg 这个问题已经检查了 之前是使用了重复的数据 现在已经修复:) 谢谢,作者好认真,赞

MaybeShewill-CV commented 4 years ago

@zgsxwsdxg 可以试试 如果还有问题的话 还可以在这里提 测试没有问题的话也可以说一下 我就close issue了:)

zgsxwsdxg commented 4 years ago

@zgsxwsdxg 可以试试 如果还有问题的话 还可以在这里提 测试没有问题的话也可以说一下 我就close issue了:)

好的,我试试

zgsxwsdxg commented 4 years ago

@zgsxwsdxg 可以试试 如果还有问题的话 还可以在这里提 测试没有问题的话也可以说一下 我就close issue了:)

020-03-31 13:48:50.165894: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. 2020-03-31 13:48:54.024627: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. 2020-03-31 13:48:54.409101: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. 2020-03-31 13:48:54.506058: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. 2020-03-31 13:48:55.083066: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. I0331 13:48:56.153539 21003 train_crnn.py:530] Epoch_Train: 101 total_loss= 131.229706 lr= 0.010000 mean_cost_time= 0.537349s 2020-03-31 13:49:03.355693: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. 2020-03-31 13:49:04.245014: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. I0331 13:49:05.966566 21003 train_crnn.py:530] Epoch_Train: 111 total_loss= 119.608185 lr= 0.010000 mean_cost_time= 0.533212s 2020-03-31 13:49:09.684304: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. 2020-03-31 13:49:11.708855: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. 2020-03-31 13:49:11.710654: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. I0331 13:49:15.651476 21003 train_crnn.py:530] Epoch_Train: 121 total_loss= 131.716919 lr= 0.010000 mean_cost_time= 0.523037s 2020-03-31 13:49:18.441830: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. 2020-03-31 13:49:18.765835: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. 2020-03-31 13:49:19.846209: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. 2020-03-31 13:49:21.377894: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. 2020-03-31 13:49:22.276449: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. 2020-03-31 13:49:23.298391: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. I0331 13:49:25.348873 21003 train_crnn.py:530] Epoch_Train: 131 total_loss= 130.303802 lr= 0.010000 mean_cost_time= 0.527877s 2020-03-31 13:49:26.299247: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. 2020-03-31 13:49:27.589161: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. 2020-03-31 13:49:28.211693: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. 2020-03-31 13:49:28.563087: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. 2020-03-31 13:49:30.087040: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. 2020-03-31 13:49:31.552785: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. I0331 13:49:35.105267 21003 train_crnn.py:530] Epoch_Train: 141 total_loss= 123.283287 lr= 0.010000 mean_cost_time= 0.532893s 2020-03-31 13:49:38.255606: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. 2020-03-31 13:49:39.879553: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. 2020-03-31 13:49:39.897993: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. 2020-03-31 13:49:42.783451: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. 2020-03-31 13:49:44.059701: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. 2020-03-31 13:49:44.244836: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. 2020-03-31 13:49:44.745186: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found.

作者您好,发现训练时,有No valid path found 和 ctc loss 是inf的情况这个怎么处理?正常么?期待您的回复

MaybeShewill-CV commented 4 years ago

@zgsxwsdxg no valid path需要你检查你的标签文件是不是有的标签长度超过了sequence length. ctc loss异常可能也是由标签不对引起的。在synth90k数据集和生成的中文数据集上测试都没有问题 你可以先在这两个数据集上验证一下。

zgsxwsdxg commented 4 years ago

需要你检查你的标签文件是不是有的标签长度超过了

长度这个我可以确定,我的数据标签长度是小于sequece length ,因为我在 生成tfrecord时,加入了限制,必须小于sequence length 的数据才会下入tfrecord, 但是有一点我这边训练我自己的中文数据,我重新定义了图像的输入大小是(32,300) 所以我保证标签不会超过75个这个理论是这样的吧?

MaybeShewill-CV commented 4 years ago

@zgsxwsdxg 是的 你的标签不能超过75:)

zgsxwsdxg commented 4 years ago

@zgsxwsdxg 是的 你的标签不能超过75:)

那这样的话,首先保证了标签长度没有超过sequence length应该不是这个原因;是不是ctc loss 对于长度长点的就会出现No valid path found 和 ctc loss 是inf的情况?或者是多卡样本太多导致的?我也不清楚?亲,你对这个有深入的了解不?给些指导,谢谢

MaybeShewill-CV commented 4 years ago

@zgsxwsdxg 长度长也不会出现这个问题,多卡训练synth90k和synth chinese中文数据集都没有问题:)

zgsxwsdxg commented 4 years ago

@zgsxwsdxg 长度长也不会出现这个问题,多卡训练synth90k和synth chinese中文数据集都没有问题:)

谢谢你,我自己再找找问题吧!我这个issue可以关掉了。

MaybeShewill-CV commented 4 years ago

@zgsxwsdxg 嗯 你可以再看看 https://www.dropbox.com/sh/y4eaunamardibnd/AAB4h8NkakASDoc6Ek4knEGIa?dl=0 这个新上传的模型就是两个gpu迭代8万次训练的:)