PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.24k stars 5.58k forks source link

Floating point exception (core dumped) #8702

Closed kanchangcheng closed 6 years ago

kanchangcheng commented 6 years ago
*** Aborted at 1519973684 (unix time) try "date -d @1519973684" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGFPE (@0x7f89bd8a840a) received by PID 46 (TID 0x7f8793992700) from PID 18446744072594555914; stack trace: ***
    @     0x7f89da7c6390 (unknown)
    @     0x7f89bd8a840a _ZN6paddle17AssignCpuEvaluateIRNS_14TensorAssignOpINS_11BaseMatrixTIfEENS_14TensorBinaryOpIN4hppl6binary3addIfEEKNS_13TensorUnaryOpINS5_5unary9mul_scaleIfEEKS3_fEESF_fEEfEEJRNS1_IS3_NS4_IS8_SF_KNS9_ISC_KNS9_INSA_6squareIfEESD_fEEfEEfEEfEERNS1_IS3_NS4_INS6_3subIfEESD_KNS4_INS6_3divIfEESF_KNS9_INSA_9add_scaleIfEEKNS9_INSA_7sqrt_opIfEESD_fEEfEEfEEfEEfEEEEEviibOT_DpOT0_
    @     0x7f89bd8ab637 _ZN6paddle14AssignEvaluateIRNS_14TensorAssignOpINS_11BaseMatrixTIfEENS_14TensorBinaryOpIN4hppl6binary3addIfEEKNS_13TensorUnaryOpINS5_5unary9mul_scaleIfEEKS3_fEESF_fEEfEEJRNS1_IS3_NS4_IS8_SF_KNS9_ISC_KNS9_INSA_6squareIfEESD_fEEfEEfEEfEERNS1_IS3_NS4_INS6_3subIfEESD_KNS4_INS6_3divIfEESF_KNS9_INSA_9add_scaleIfEEKNS9_INSA_7sqrt_opIfEESD_fEEfEEfEEfEEfEEEEEvOT_DpOT0_
    @     0x7f89bd8a4abb paddle::adamApply()
    @     0x7f89bd894496 paddle::AdamParameterOptimizer::update()
    @     0x7f89bd894956 paddle::OptimizerWithGradientClipping::update()
    @     0x7f89bd88906f paddle::SgdThreadUpdater::threadUpdateDense()
    @     0x7f89bd88a0ef _ZNSt17_Function_handlerIFvimEZN6paddle16SgdThreadUpdater11finishBatchEfEUlimE_E9_M_invokeERKSt9_Any_dataim
    @     0x7f89bd6aa39c _ZNSt6thread5_ImplISt12_Bind_simpleIFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
    @     0x7f89cadebc80 (unknown)
    @     0x7f89da7bc6ba start_thread
    @     0x7f89da4f241d clone
    @                0x0 (unknown)
Floating point exception (core dumped)

训练过程遇到该错误,已查看过类似的issue,但该错误还没有解决,求助各位大神!

chengduoZH commented 6 years ago

请描述一下您的模型是在什么环境下训练的?

shboy commented 6 years ago

docker.paddlepaddlehub.com/paddle latest-gpu 在这个docker里面跑的 @chengduoZH

前两天刚下的 应该是最新版本了

chengduoZH commented 6 years ago

您的Adam层的参数是怎么设置的?

shboy commented 6 years ago

lr = 0.000002 Adam_optimizer = paddle.optimizer.Adam( learning_rate=lr, beta1=0.9, beta2=0.999, epsilon=0, gradient_clipping_threshold=10.0) @chengduoZH

shboy commented 6 years ago

我们之前用keras训练同样的数据 是没有问题的

chengduoZH commented 6 years ago

related issue https://github.com/PaddlePaddle/Paddle/issues/2262 and https://github.com/PaddlePaddle/Paddle/issues/2563

shboy commented 6 years ago
    f_para_grad = open("para_grad.txt",'a+')
    if isinstance(event, paddle.event.EndForwardBackward):
        if isinstance(event, paddle.event.EndForwardBackward):
            for p in parameters.keys():
                print("Param %s, Grad %s",
                    parameters.get(p), parameters.get_grad(p))
                #f_para_grad.write("Param %s, Grad %s",
                #    parameters.get(p), parameters.get_grad(p))
                f_para_grad.write("Param %s"+"\n")
                for item in parameters.get(p):
                    f_para_grad.write(str(item)+ ' ')
                f_para_grad.write("\n")
                f_para_grad.write("Grad %s"+"\n")
                for item in parameters.get_grad(p):
                    f_para_grad.write(str(item)+ ' ')
                f_para_grad.write("\n")

3531cc881c22e6822a9d72ae7de72a8c

我把梯度打出来了 貌似也没有错

shboy commented 6 years ago

lr = 0.000002 Adam_optimizer = paddle.optimizer.Adam( learning_rate=lr, beta1=0.9, beta2=0.999, epsilon=0, gradient_clipping_threshold=10.0)

我把gradient_clipping_threshold=10.0给去了 仍然是同样的错 3531cc881c22e6822a9d72ae7de72a8c

chengduoZH commented 6 years ago
Adam_optimizer = paddle.optimizer.Adam(
learning_rate=lr,
beta1=0.9, beta2=0.999, epsilon=0, gradient_clipping_threshold=10.0)

不要把epsilon设成0,epsilon一般是非常小的值,比如0.000001,如果这里不设置,Adam会使用默认的epsilon。

chengduoZH commented 6 years ago

问题已解决

Littlehead27 commented 1 year ago

[2023/05/24 20:21:18] ppocr INFO: cur metric, precision: 0, recall: 0, hmean: 0, fps: 7.03872743678866 [2023/05/24 20:21:35] ppocr INFO: save best model is to ./output/re_vi_layoutxlm_xfund_zh/best_accuracy [2023/05/24 20:21:35] ppocr INFO: best metric, hmean: 0, precision: 0, recall: 0, fps: 7.03872743678866, best_epoch: 1 [2023/05/24 20:21:37] ppocr INFO: epoch: [1/50], global_step: 210, lr: 0.000004, loss: 0.267303, avg_reader_cost: 0.00025 s, avg_batch_cost: 0.19397 s, avg_samples: 1.0, ips: 5.15534 samples/s, eta: 1:44:15 [2023/05/24 20:21:39] ppocr INFO: epoch: [1/50], global_step: 220, lr: 0.000004, loss: 0.204350, avg_reader_cost: 0.00019 s, avg_batch_cost: 0.23311 s, avg_samples: 1.0, ips: 4.28986 samples/s, eta: 1:41:35 [2023/05/24 20:21:42] ppocr INFO: epoch: [1/50], global_step: 230, lr: 0.000005, loss: 0.237258, avg_reader_cost: 0.00019 s, avg_batch_cost: 0.19782 s, avg_samples: 1.0, ips: 5.05522 samples/s, eta: 1:38:50 [2023/05/24 20:21:44] ppocr INFO: epoch: [1/50], global_step: 240, lr: 0.000005, loss: 0.265792, avg_reader_cost: 0.00019 s, avg_batch_cost: 0.18138 s, avg_samples: 1.0, ips: 5.51319 samples/s, eta: 1:36:10 Floating point exception (core dumped)

训练 re模型 报这个错误

Littlehead27 commented 1 year ago

问题已解决

[2023/05/24 20:21:18] ppocr INFO: cur metric, precision: 0, recall: 0, hmean: 0, fps: 7.03872743678866 [2023/05/24 20:21:35] ppocr INFO: save best model is to ./output/re_vi_layoutxlm_xfund_zh/best_accuracy [2023/05/24 20:21:35] ppocr INFO: best metric, hmean: 0, precision: 0, recall: 0, fps: 7.03872743678866, best_epoch: 1 [2023/05/24 20:21:37] ppocr INFO: epoch: [1/50], global_step: 210, lr: 0.000004, loss: 0.267303, avg_reader_cost: 0.00025 s, avg_batch_cost: 0.19397 s, avg_samples: 1.0, ips: 5.15534 samples/s, eta: 1:44:15 [2023/05/24 20:21:39] ppocr INFO: epoch: [1/50], global_step: 220, lr: 0.000004, loss: 0.204350, avg_reader_cost: 0.00019 s, avg_batch_cost: 0.23311 s, avg_samples: 1.0, ips: 4.28986 samples/s, eta: 1:41:35 [2023/05/24 20:21:42] ppocr INFO: epoch: [1/50], global_step: 230, lr: 0.000005, loss: 0.237258, avg_reader_cost: 0.00019 s, avg_batch_cost: 0.19782 s, avg_samples: 1.0, ips: 5.05522 samples/s, eta: 1:38:50 [2023/05/24 20:21:44] ppocr INFO: epoch: [1/50], global_step: 240, lr: 0.000005, loss: 0.265792, avg_reader_cost: 0.00019 s, avg_batch_cost: 0.18138 s, avg_samples: 1.0, ips: 5.51319 samples/s, eta: 1:36:10 Floating point exception (core dumped)

你好,我这 还没解决啊