Closed kanchangcheng closed 6 years ago
请描述一下您的模型是在什么环境下训练的?
docker.paddlepaddlehub.com/paddle latest-gpu 在这个docker里面跑的 @chengduoZH
前两天刚下的 应该是最新版本了
您的Adam层的参数是怎么设置的?
lr = 0.000002 Adam_optimizer = paddle.optimizer.Adam( learning_rate=lr, beta1=0.9, beta2=0.999, epsilon=0, gradient_clipping_threshold=10.0) @chengduoZH
我们之前用keras训练同样的数据 是没有问题的
f_para_grad = open("para_grad.txt",'a+')
if isinstance(event, paddle.event.EndForwardBackward):
if isinstance(event, paddle.event.EndForwardBackward):
for p in parameters.keys():
print("Param %s, Grad %s",
parameters.get(p), parameters.get_grad(p))
#f_para_grad.write("Param %s, Grad %s",
# parameters.get(p), parameters.get_grad(p))
f_para_grad.write("Param %s"+"\n")
for item in parameters.get(p):
f_para_grad.write(str(item)+ ' ')
f_para_grad.write("\n")
f_para_grad.write("Grad %s"+"\n")
for item in parameters.get_grad(p):
f_para_grad.write(str(item)+ ' ')
f_para_grad.write("\n")
我把梯度打出来了 貌似也没有错
lr = 0.000002 Adam_optimizer = paddle.optimizer.Adam( learning_rate=lr, beta1=0.9, beta2=0.999, epsilon=0, gradient_clipping_threshold=10.0)
我把gradient_clipping_threshold=10.0给去了 仍然是同样的错
Adam_optimizer = paddle.optimizer.Adam(
learning_rate=lr,
beta1=0.9, beta2=0.999, epsilon=0, gradient_clipping_threshold=10.0)
不要把epsilon设成0,epsilon一般是非常小的值,比如0.000001,如果这里不设置,Adam会使用默认的epsilon。
问题已解决
[2023/05/24 20:21:18] ppocr INFO: cur metric, precision: 0, recall: 0, hmean: 0, fps: 7.03872743678866 [2023/05/24 20:21:35] ppocr INFO: save best model is to ./output/re_vi_layoutxlm_xfund_zh/best_accuracy [2023/05/24 20:21:35] ppocr INFO: best metric, hmean: 0, precision: 0, recall: 0, fps: 7.03872743678866, best_epoch: 1 [2023/05/24 20:21:37] ppocr INFO: epoch: [1/50], global_step: 210, lr: 0.000004, loss: 0.267303, avg_reader_cost: 0.00025 s, avg_batch_cost: 0.19397 s, avg_samples: 1.0, ips: 5.15534 samples/s, eta: 1:44:15 [2023/05/24 20:21:39] ppocr INFO: epoch: [1/50], global_step: 220, lr: 0.000004, loss: 0.204350, avg_reader_cost: 0.00019 s, avg_batch_cost: 0.23311 s, avg_samples: 1.0, ips: 4.28986 samples/s, eta: 1:41:35 [2023/05/24 20:21:42] ppocr INFO: epoch: [1/50], global_step: 230, lr: 0.000005, loss: 0.237258, avg_reader_cost: 0.00019 s, avg_batch_cost: 0.19782 s, avg_samples: 1.0, ips: 5.05522 samples/s, eta: 1:38:50 [2023/05/24 20:21:44] ppocr INFO: epoch: [1/50], global_step: 240, lr: 0.000005, loss: 0.265792, avg_reader_cost: 0.00019 s, avg_batch_cost: 0.18138 s, avg_samples: 1.0, ips: 5.51319 samples/s, eta: 1:36:10 Floating point exception (core dumped)
训练 re模型 报这个错误
问题已解决
[2023/05/24 20:21:18] ppocr INFO: cur metric, precision: 0, recall: 0, hmean: 0, fps: 7.03872743678866 [2023/05/24 20:21:35] ppocr INFO: save best model is to ./output/re_vi_layoutxlm_xfund_zh/best_accuracy [2023/05/24 20:21:35] ppocr INFO: best metric, hmean: 0, precision: 0, recall: 0, fps: 7.03872743678866, best_epoch: 1 [2023/05/24 20:21:37] ppocr INFO: epoch: [1/50], global_step: 210, lr: 0.000004, loss: 0.267303, avg_reader_cost: 0.00025 s, avg_batch_cost: 0.19397 s, avg_samples: 1.0, ips: 5.15534 samples/s, eta: 1:44:15 [2023/05/24 20:21:39] ppocr INFO: epoch: [1/50], global_step: 220, lr: 0.000004, loss: 0.204350, avg_reader_cost: 0.00019 s, avg_batch_cost: 0.23311 s, avg_samples: 1.0, ips: 4.28986 samples/s, eta: 1:41:35 [2023/05/24 20:21:42] ppocr INFO: epoch: [1/50], global_step: 230, lr: 0.000005, loss: 0.237258, avg_reader_cost: 0.00019 s, avg_batch_cost: 0.19782 s, avg_samples: 1.0, ips: 5.05522 samples/s, eta: 1:38:50 [2023/05/24 20:21:44] ppocr INFO: epoch: [1/50], global_step: 240, lr: 0.000005, loss: 0.265792, avg_reader_cost: 0.00019 s, avg_batch_cost: 0.18138 s, avg_samples: 1.0, ips: 5.51319 samples/s, eta: 1:36:10 Floating point exception (core dumped)
你好,我这 还没解决啊
训练过程遇到该错误,已查看过类似的issue,但该错误还没有解决,求助各位大神!