PaddlePaddle / models

Officially maintained, supported by PaddlePaddle, including CV, NLP, Speech, Rec, TS, big models and so on.
Apache License 2.0
6.9k stars 2.91k forks source link

ocr_recognition识别模型训练过程中出现NaN导致无法训练 #4261

Closed endy-see closed 4 years ago

endy-see commented 4 years ago

报错信息是:NaN or Inf found in input tensor 此时的学习率已经调到0.0001,训练很慢了,但是还会出现NaN,请问该怎么处理这个问题?

chenwhql commented 4 years ago

有几点跟您确认下:

  1. 您这个问题容易复现吗?建议您可以加上FLAGS_check_nan_inf=True定位一下是在哪个Op里出nan的
    /**
    * Operator related FLAG
    * Name: FLAGS_check_nan_inf
    * Since Version: 0.13.0
    * Value Range: bool, default=false
    * Example:
    * Note: Used to debug. Checking whether operator produce NAN/INF or not.
    */
    DEFINE_bool(check_nan_inf, false,
            "Checking whether operator produce NAN/INF or not. It will be "
            "extremely slow so please use this flag wisely.");
  2. 这个模型是使用的models里面的原版模型吗?能描述下您有做什么改动?以及您具体触发这个问题的操作过程吗?
endy-see commented 4 years ago
  1. 容易复现,问下FLAGS_check_nan_inf=True是直接加载python train.py前面么?
  2. 我用的是自己的数据集,数据量总共1w+,batch_size=32。明显的改动的话就是字典大小,字典里面只有31个中文字符。另外发现一个很奇怪的问题,如果把num_classes设置为95,就不会出现NaN,而将 num_classes刚好设置为字典大小,就会NaN(每次都会NaN,必现)
chenwhql commented 4 years ago
  1. 容易复现,问下FLAGS_check_nan_inf=True是直接加载python train.py前面么?
  2. 我用的是自己的数据集,数据量总共1w+,batch_size=32。明显的改动的话就是字典大小,字典里面只有31个中文字符。另外发现一个很奇怪的问题,如果把num_classes设置为95,就不会出现NaN,而将 num_classes刚好设置为字典大小,就会NaN(每次都会NaN,必现)

您好,1. 是的,建议您试一下,看一看哪个Op出错了,可以把结果继续反馈到下面

  1. 这个稍等我先看看哈,模型不太熟悉
endy-see commented 4 years ago
  1. 容易复现,问下FLAGS_check_nan_inf=True是直接加载python train.py前面么?
  2. 我用的是自己的数据集,数据量总共1w+,batch_size=32。明显的改动的话就是字典大小,字典里面只有31个中文字符。另外发现一个很奇怪的问题,如果把num_classes设置为95,就不会出现NaN,而将 num_classes刚好设置为字典大小,就会NaN(每次都会NaN,必现)

您好,1. 是的,建议您试一下,看一看哪个Op出错了,可以把结果继续反馈到下面

  1. 这个稍等我先看看哈,模型不太熟悉

添加完FLAGS_check_nan_inf后发现还没开始打迭代日志就崩溃了,内容如下: [root@07ba1e8482af ocr_recognition]# sh train_cq_src_fake_without_padding_height48.sh ----------- Configuration Arguments ----------- average_window: 0.15 batch_size: 32 eval_period: 10000 gradient_clip: 10.0 init_model: None l2decay: 0.0001 log_period: 1000 lr: 0.0001 lr_decay_strategy: None max_average_window: 12500 min_average_window: 10000 model: crnn_ctc momentum: 0.9 parallel: 1 profile: False save_model_dir: output/ChongQiongCourt/ save_model_period: 10000 skip_batch_num: 0 skip_test: False test_images: dataset/ChongCourt-src+fake-ForServer-without-padding-resize-same-height48/test_images test_list: dataset/ChongCourt-src+fake-ForServer-without-padding-resize-same-height48/test.list total_step: 2000000 train_images: dataset/ChongCourt-src+fake-ForServer-without-padding-resize-same-height48/train_images train_list: dataset/ChongCourt-src+fake-ForServer-without-padding-resize-same-height48/train.list use_gpu: True

/root/miniconda3/lib/python3.6/site-packages/paddle/fluid/evaluator.py:72: Warning: The EditDistance is deprecated, because maintain a modified program inside evaluator cause bug easily, please use fluid.metrics.EditDistance instead. % (self.class.name, self.class.name), Warning) finish batch shuffle W0212 02:40:36.310628 22970 device_context.cc:235] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.0, Runtime API Version: 10.0 W0212 02:40:36.317950 22970 device_context.cc:243] device: 0, cuDNN Version: 7.6. I0212 02:40:40.715414 22970 parallel_executor.cc:421] The number of CUDAPlace, which is used in ParallelExecutor, is 3. And the Program will be copied 3 copies W0212 02:40:44.676672 22970 fuse_all_reduce_op_pass.cc:72] Find all_reduce operators: 43. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 21. I0212 02:40:44.682669 22970 build_strategy.cc:363] SeqOnlyAllReduceOps:0, num_trainers:1 I0212 02:40:44.734843 22970 parallel_executor.cc:285] Inplace strategy is enabled, when build_strategy.enable_inplace = True I0212 02:40:44.764928 22970 parallel_executor.cc:368] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0 /root/miniconda3/lib/python3.6/site-packages/paddle/fluid/executor.py:774: UserWarning: The following exception is not an EOF exception. "The following exception is not an EOF exception.") Traceback (most recent call last): File "train.py", line 251, in main() File "train.py", line 247, in main train(args) File "train.py", line 175, in train results = train_one_batch(data) File "train.py", line 130, in train_one_batch feed=get_feeder_data(data, place)) File "/root/miniconda3/lib/python3.6/site-packages/paddle/fluid/parallel_executor.py", line 311, in run return_numpy=return_numpy) File "/root/miniconda3/lib/python3.6/site-packages/paddle/fluid/executor.py", line 775, in run six.reraise(*sys.exc_info()) File "/root/miniconda3/lib/python3.6/site-packages/six.py", line 693, in reraise raise value File "/root/miniconda3/lib/python3.6/site-packages/paddle/fluid/executor.py", line 770, in run use_program_cache=use_program_cache) File "/root/miniconda3/lib/python3.6/site-packages/paddle/fluid/executor.py", line 829, in _run_impl return_numpy=return_numpy) File "/root/miniconda3/lib/python3.6/site-packages/paddle/fluid/executor.py", line 669, in _run_parallel tensors = exe.run(fetch_var_names)._move_to_list() paddle.fluid.core_avx.EnforceNotMet:


C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<char const>(char const&&, char const, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const, int) 2 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void> const&, paddle::framework::RuntimeContext*) const 3 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void> const&) const 4 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void_> const&) 5 paddle::framework::details::ScopeBufferedSSAGraphExecutor::InitVariables() 6 paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run(std::vector<std::string, std::allocator > const&) 7 paddle::framework::ParallelExecutor::Run(std::vector<std::string, std::allocator > const&)


Error Message Summary:

PaddleCheckError: Operator coalesce_tensor output Tensor @FUSEDVAR@@GRAD@fc_2.b_0@GRAD contains NAN at [/paddle/paddle/fluid/framework/operator.cc:845]

[root@07ba1e8482af ocr_recognition]#

dyning commented 4 years ago

我感觉是训练程序中的类别数目没有修改,需要修改过https://github.com/PaddlePaddle/models/blob/develop/PaddleCV/ocr_recognition/data_reader.py 这个里面的NUM_CLASSES设置,你可以修改这个试试,有问题再反馈。感谢您的关注,此外,我们也在重构OCR代码,增加更多文字检测和识别算法,敬请期待。

endy-see commented 4 years ago

我感觉是训练程序中的类别数目没有修改,需要修改过https://github.com/PaddlePaddle/models/blob/develop/PaddleCV/ocr_recognition/data_reader.py 这个里面的NUM_CLASSES设置,你可以修改这个试试,有问题再反馈。感谢您的关注,此外,我们也在重构OCR代码,增加更多文字检测和识别算法,敬请期待。

我修改的就是https://github.com/PaddlePaddle/models/blob/develop/PaddleCV/ocr_recognition/data_reader.py 里面NUM_CLASSES这个参数,经过了多轮实验,发现对当前数据集而言(字典中的实际字符就是31个),但是将NUM_CLASSES设置成31时,训练几乎无法收敛,验证集和测试集效果极差(几乎没有预测正确的),但是NUM_CLASSES手动设置到95时,收敛加快,测试集准确率达到90%,请问这个能解释一下么?

chenwhql commented 4 years ago

看这个报错的Op,可以试试作以下更改,同时保持FLAG打开,还会不会出这个问题:

https://github.com/PaddlePaddle/models/blob/22b8805bb5a6a6ee28fb228fc4ce7fee4791bef1/PaddleCV/ocr_recognition/train.py#L119

改成:

        build_strategy = fluid.BuildStrategy()
        build_strategy.fuse_all_reduce_ops = False
        train_exe = fluid.ParallelExecutor(
            use_cuda=True if args.use_gpu else False, loss_name=sum_cost.name,
            build_strategy=build_strategy)
endy-see commented 4 years ago

看这个报错的Op,可以试试作以下更改,同时保持FLAG打开,还会不会出这个问题:

https://github.com/PaddlePaddle/models/blob/22b8805bb5a6a6ee28fb228fc4ce7fee4791bef1/PaddleCV/ocr_recognition/train.py#L119

改成:

        build_strategy = fluid.BuildStrategy()
        build_strategy.fuse_all_reduce_ops = False
        train_exe = fluid.ParallelExecutor(
            use_cuda=True if args.use_gpu else False, loss_name=sum_cost.name,
            build_strategy=build_strategy)

问题的主要原因在于我的字典中字符序号是从1开始的,而模型训练的label是从0开始的,将NUM_CLASSES改为字典中字符个数+1即可

endy-see commented 4 years ago

看这个报错的Op,可以试试作以下更改,同时保持FLAG打开,还会不会出这个问题: https://github.com/PaddlePaddle/models/blob/22b8805bb5a6a6ee28fb228fc4ce7fee4791bef1/PaddleCV/ocr_recognition/train.py#L119

改成:

        build_strategy = fluid.BuildStrategy()
        build_strategy.fuse_all_reduce_ops = False
        train_exe = fluid.ParallelExecutor(
            use_cuda=True if args.use_gpu else False, loss_name=sum_cost.name,
            build_strategy=build_strategy)

问题的主要原因在于我的字典中字符序号是从1开始的,而模型训练的label是从0开始的,将NUM_CLASSES改为字典中字符个数+1即可

按照这种方式修改了以后,目前已经训练了2次了,还没有出现,如果后面复现了会再reopen,感谢解答!