ocr_recognition识别模型训练过程中出现NaN导致无法训练

endy-see commented 4 years ago

报错信息是：NaN or Inf found in input tensor 此时的学习率已经调到0.0001，训练很慢了，但是还会出现NaN，请问该怎么处理这个问题？

chenwhql commented 4 years ago

有几点跟您确认下：

您这个问题容易复现吗？建议您可以加上FLAGS_check_nan_inf=True定位一下是在哪个Op里出nan的

/**
* Operator related FLAG
* Name: FLAGS_check_nan_inf
* Since Version: 0.13.0
* Value Range: bool, default=false
* Example:
* Note: Used to debug. Checking whether operator produce NAN/INF or not.
*/
DEFINE_bool(check_nan_inf, false,
        "Checking whether operator produce NAN/INF or not. It will be "
        "extremely slow so please use this flag wisely.");

这个模型是使用的models里面的原版模型吗？能描述下您有做什么改动？以及您具体触发这个问题的操作过程吗？

endy-see commented 4 years ago

容易复现，问下FLAGS_check_nan_inf=True是直接加载python train.py前面么？
我用的是自己的数据集，数据量总共1w+，batch_size=32。明显的改动的话就是字典大小，字典里面只有31个中文字符。另外发现一个很奇怪的问题，如果把num_classes设置为95，就不会出现NaN，而将 num_classes刚好设置为字典大小，就会NaN（每次都会NaN，必现）

chenwhql commented 4 years ago

容易复现，问下FLAGS_check_nan_inf=True是直接加载python train.py前面么？

我用的是自己的数据集，数据量总共1w+，batch_size=32。明显的改动的话就是字典大小，字典里面只有31个中文字符。另外发现一个很奇怪的问题，如果把num_classes设置为95，就不会出现NaN，而将 num_classes刚好设置为字典大小，就会NaN（每次都会NaN，必现）

您好，1. 是的，建议您试一下，看一看哪个Op出错了，可以把结果继续反馈到下面

这个稍等我先看看哈，模型不太熟悉

endy-see commented 4 years ago

容易复现，问下FLAGS_check_nan_inf=True是直接加载python train.py前面么？

我用的是自己的数据集，数据量总共1w+，batch_size=32。明显的改动的话就是字典大小，字典里面只有31个中文字符。另外发现一个很奇怪的问题，如果把num_classes设置为95，就不会出现NaN，而将 num_classes刚好设置为字典大小，就会NaN（每次都会NaN，必现）

您好，1. 是的，建议您试一下，看一看哪个Op出错了，可以把结果继续反馈到下面

这个稍等我先看看哈，模型不太熟悉

添加完FLAGS_check_nan_inf后发现还没开始打迭代日志就崩溃了，内容如下： [root@07ba1e8482af ocr_recognition]# sh train_cq_src_fake_without_padding_height48.sh ----------- Configuration Arguments ----------- average_window: 0.15 batch_size: 32 eval_period: 10000 gradient_clip: 10.0 init_model: None l2decay: 0.0001 log_period: 1000 lr: 0.0001 lr_decay_strategy: None max_average_window: 12500 min_average_window: 10000 model: crnn_ctc momentum: 0.9 parallel: 1 profile: False save_model_dir: output/ChongQiongCourt/ save_model_period: 10000 skip_batch_num: 0 skip_test: False test_images: dataset/ChongCourt-src+fake-ForServer-without-padding-resize-same-height48/test_images test_list: dataset/ChongCourt-src+fake-ForServer-without-padding-resize-same-height48/test.list total_step: 2000000 train_images: dataset/ChongCourt-src+fake-ForServer-without-padding-resize-same-height48/train_images train_list: dataset/ChongCourt-src+fake-ForServer-without-padding-resize-same-height48/train.list use_gpu: True

/root/miniconda3/lib/python3.6/site-packages/paddle/fluid/evaluator.py:72: Warning: The EditDistance is deprecated, because maintain a modified program inside evaluator cause bug easily, please use fluid.metrics.EditDistance instead. % (self.class.name, self.class.name), Warning) finish batch shuffle W0212 02:40:36.310628 22970 device_context.cc:235] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.0, Runtime API Version: 10.0 W0212 02:40:36.317950 22970 device_context.cc:243] device: 0, cuDNN Version: 7.6. I0212 02:40:40.715414 22970 parallel_executor.cc:421] The number of CUDAPlace, which is used in ParallelExecutor, is 3. And the Program will be copied 3 copies W0212 02:40:44.676672 22970 fuse_all_reduce_op_pass.cc:72] Find all_reduce operators: 43. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 21. I0212 02:40:44.682669 22970 build_strategy.cc:363] SeqOnlyAllReduceOps:0, num_trainers:1 I0212 02:40:44.734843 22970 parallel_executor.cc:285] Inplace strategy is enabled, when build_strategy.enable_inplace = True I0212 02:40:44.764928 22970 parallel_executor.cc:368] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0 /root/miniconda3/lib/python3.6/site-packages/paddle/fluid/executor.py:774: UserWarning: The following exception is not an EOF exception. "The following exception is not an EOF exception.") Traceback (most recent call last): File "train.py", line 251, in main() File "train.py", line 247, in main train(args) File "train.py", line 175, in train results = train_one_batch(data) File "train.py", line 130, in train_one_batch feed=get_feeder_data(data, place)) File "/root/miniconda3/lib/python3.6/site-packages/paddle/fluid/parallel_executor.py", line 311, in run return_numpy=return_numpy) File "/root/miniconda3/lib/python3.6/site-packages/paddle/fluid/executor.py", line 775, in run six.reraise(*sys.exc_info()) File "/root/miniconda3/lib/python3.6/site-packages/six.py", line 693, in reraise raise value File "/root/miniconda3/lib/python3.6/site-packages/paddle/fluid/executor.py", line 770, in run use_program_cache=use_program_cache) File "/root/miniconda3/lib/python3.6/site-packages/paddle/fluid/executor.py", line 829, in _run_impl return_numpy=return_numpy) File "/root/miniconda3/lib/python3.6/site-packages/paddle/fluid/executor.py", line 669, in _run_parallel tensors = exe.run(fetch_var_names)._move_to_list() paddle.fluid.core_avx.EnforceNotMet:

C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<char const>(char const&&, char const, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const, int) 2 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void> const&, paddle::framework::RuntimeContext*) const 3 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void> const&) const 4 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void_> const&) 5 paddle::framework::details::ScopeBufferedSSAGraphExecutor::InitVariables() 6 paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run(std::vector<std::string, std::allocator > const&) 7 paddle::framework::ParallelExecutor::Run(std::vector<std::string, std::allocator > const&)

Error Message Summary:

PaddleCheckError: Operator coalesce_tensor output Tensor @FUSEDVAR@@GRAD@fc_2.b_0@GRAD contains NAN at [/paddle/paddle/fluid/framework/operator.cc:845]

[root@07ba1e8482af ocr_recognition]#

dyning commented 4 years ago

我感觉是训练程序中的类别数目没有修改，需要修改过https://github.com/PaddlePaddle/models/blob/develop/PaddleCV/ocr_recognition/data_reader.py 这个里面的NUM_CLASSES设置，你可以修改这个试试，有问题再反馈。感谢您的关注，此外，我们也在重构OCR代码，增加更多文字检测和识别算法，敬请期待。

endy-see commented 4 years ago

我感觉是训练程序中的类别数目没有修改，需要修改过https://github.com/PaddlePaddle/models/blob/develop/PaddleCV/ocr_recognition/data_reader.py 这个里面的NUM_CLASSES设置，你可以修改这个试试，有问题再反馈。感谢您的关注，此外，我们也在重构OCR代码，增加更多文字检测和识别算法，敬请期待。

我修改的就是https://github.com/PaddlePaddle/models/blob/develop/PaddleCV/ocr_recognition/data_reader.py 里面NUM_CLASSES这个参数，经过了多轮实验，发现对当前数据集而言（字典中的实际字符就是31个），但是将NUM_CLASSES设置成31时，训练几乎无法收敛，验证集和测试集效果极差（几乎没有预测正确的），但是NUM_CLASSES手动设置到95时，收敛加快，测试集准确率达到90%，请问这个能解释一下么？

chenwhql commented 4 years ago

看这个报错的Op，可以试试作以下更改，同时保持FLAG打开，还会不会出这个问题：

https://github.com/PaddlePaddle/models/blob/22b8805bb5a6a6ee28fb228fc4ce7fee4791bef1/PaddleCV/ocr_recognition/train.py#L119

改成：

        build_strategy = fluid.BuildStrategy()
        build_strategy.fuse_all_reduce_ops = False
        train_exe = fluid.ParallelExecutor(
            use_cuda=True if args.use_gpu else False, loss_name=sum_cost.name,
            build_strategy=build_strategy)

endy-see commented 4 years ago

看这个报错的Op，可以试试作以下更改，同时保持FLAG打开，还会不会出这个问题：

https://github.com/PaddlePaddle/models/blob/22b8805bb5a6a6ee28fb228fc4ce7fee4791bef1/PaddleCV/ocr_recognition/train.py#L119

改成：
        build_strategy = fluid.BuildStrategy()
        build_strategy.fuse_all_reduce_ops = False
        train_exe = fluid.ParallelExecutor(
            use_cuda=True if args.use_gpu else False, loss_name=sum_cost.name,
            build_strategy=build_strategy)

问题的主要原因在于我的字典中字符序号是从1开始的，而模型训练的label是从0开始的，将NUM_CLASSES改为字典中字符个数+1即可

endy-see commented 4 years ago

看这个报错的Op，可以试试作以下更改，同时保持FLAG打开，还会不会出这个问题： https://github.com/PaddlePaddle/models/blob/22b8805bb5a6a6ee28fb228fc4ce7fee4791bef1/PaddleCV/ocr_recognition/train.py#L119

改成：
        build_strategy = fluid.BuildStrategy()
        build_strategy.fuse_all_reduce_ops = False
        train_exe = fluid.ParallelExecutor(
            use_cuda=True if args.use_gpu else False, loss_name=sum_cost.name,
            build_strategy=build_strategy)
问题的主要原因在于我的字典中字符序号是从1开始的，而模型训练的label是从0开始的，将NUM_CLASSES改为字典中字符个数+1即可

按照这种方式修改了以后，目前已经训练了2次了，还没有出现，如果后面复现了会再reopen，感谢解答！

PaddlePaddle / models

ocr_recognition识别模型训练过程中出现NaN导致无法训练 #4261

C++ Call Stacks (More useful to developers):

Error Message Summary: