Closed endy-see closed 4 years ago
有几点跟您确认下:
/**
* Operator related FLAG
* Name: FLAGS_check_nan_inf
* Since Version: 0.13.0
* Value Range: bool, default=false
* Example:
* Note: Used to debug. Checking whether operator produce NAN/INF or not.
*/
DEFINE_bool(check_nan_inf, false,
"Checking whether operator produce NAN/INF or not. It will be "
"extremely slow so please use this flag wisely.");
- 容易复现,问下FLAGS_check_nan_inf=True是直接加载python train.py前面么?
- 我用的是自己的数据集,数据量总共1w+,batch_size=32。明显的改动的话就是字典大小,字典里面只有31个中文字符。另外发现一个很奇怪的问题,如果把num_classes设置为95,就不会出现NaN,而将 num_classes刚好设置为字典大小,就会NaN(每次都会NaN,必现)
您好,1. 是的,建议您试一下,看一看哪个Op出错了,可以把结果继续反馈到下面
- 容易复现,问下FLAGS_check_nan_inf=True是直接加载python train.py前面么?
- 我用的是自己的数据集,数据量总共1w+,batch_size=32。明显的改动的话就是字典大小,字典里面只有31个中文字符。另外发现一个很奇怪的问题,如果把num_classes设置为95,就不会出现NaN,而将 num_classes刚好设置为字典大小,就会NaN(每次都会NaN,必现)
您好,1. 是的,建议您试一下,看一看哪个Op出错了,可以把结果继续反馈到下面
- 这个稍等我先看看哈,模型不太熟悉
/root/miniconda3/lib/python3.6/site-packages/paddle/fluid/evaluator.py:72: Warning: The EditDistance is deprecated, because maintain a modified program inside evaluator cause bug easily, please use fluid.metrics.EditDistance instead.
% (self.class.name, self.class.name), Warning)
finish batch shuffle
W0212 02:40:36.310628 22970 device_context.cc:235] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.0, Runtime API Version: 10.0
W0212 02:40:36.317950 22970 device_context.cc:243] device: 0, cuDNN Version: 7.6.
I0212 02:40:40.715414 22970 parallel_executor.cc:421] The number of CUDAPlace, which is used in ParallelExecutor, is 3. And the Program will be copied 3 copies
W0212 02:40:44.676672 22970 fuse_all_reduce_op_pass.cc:72] Find all_reduce operators: 43. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 21.
I0212 02:40:44.682669 22970 build_strategy.cc:363] SeqOnlyAllReduceOps:0, num_trainers:1
I0212 02:40:44.734843 22970 parallel_executor.cc:285] Inplace strategy is enabled, when build_strategy.enable_inplace = True
I0212 02:40:44.764928 22970 parallel_executor.cc:368] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0
/root/miniconda3/lib/python3.6/site-packages/paddle/fluid/executor.py:774: UserWarning: The following exception is not an EOF exception.
"The following exception is not an EOF exception.")
Traceback (most recent call last):
File "train.py", line 251, in
0 std::string paddle::platform::GetTraceBackString<char const>(char const&&, char const, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const, int)
2 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void> const&, paddle::framework::RuntimeContext*) const
3 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void> const&) const
4 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void_> const&)
5 paddle::framework::details::ScopeBufferedSSAGraphExecutor::InitVariables()
6 paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run(std::vector<std::string, std::allocator
PaddleCheckError: Operator coalesce_tensor output Tensor @FUSEDVAR@@GRAD@fc_2.b_0@GRAD contains NAN at [/paddle/paddle/fluid/framework/operator.cc:845]
[root@07ba1e8482af ocr_recognition]#
我感觉是训练程序中的类别数目没有修改,需要修改过https://github.com/PaddlePaddle/models/blob/develop/PaddleCV/ocr_recognition/data_reader.py 这个里面的NUM_CLASSES设置,你可以修改这个试试,有问题再反馈。感谢您的关注,此外,我们也在重构OCR代码,增加更多文字检测和识别算法,敬请期待。
我感觉是训练程序中的类别数目没有修改,需要修改过https://github.com/PaddlePaddle/models/blob/develop/PaddleCV/ocr_recognition/data_reader.py 这个里面的NUM_CLASSES设置,你可以修改这个试试,有问题再反馈。感谢您的关注,此外,我们也在重构OCR代码,增加更多文字检测和识别算法,敬请期待。
我修改的就是https://github.com/PaddlePaddle/models/blob/develop/PaddleCV/ocr_recognition/data_reader.py 里面NUM_CLASSES这个参数,经过了多轮实验,发现对当前数据集而言(字典中的实际字符就是31个),但是将NUM_CLASSES设置成31时,训练几乎无法收敛,验证集和测试集效果极差(几乎没有预测正确的),但是NUM_CLASSES手动设置到95时,收敛加快,测试集准确率达到90%,请问这个能解释一下么?
看这个报错的Op,可以试试作以下更改,同时保持FLAG打开,还会不会出这个问题:
改成:
build_strategy = fluid.BuildStrategy()
build_strategy.fuse_all_reduce_ops = False
train_exe = fluid.ParallelExecutor(
use_cuda=True if args.use_gpu else False, loss_name=sum_cost.name,
build_strategy=build_strategy)
看这个报错的Op,可以试试作以下更改,同时保持FLAG打开,还会不会出这个问题:
改成:
build_strategy = fluid.BuildStrategy() build_strategy.fuse_all_reduce_ops = False train_exe = fluid.ParallelExecutor( use_cuda=True if args.use_gpu else False, loss_name=sum_cost.name, build_strategy=build_strategy)
问题的主要原因在于我的字典中字符序号是从1开始的,而模型训练的label是从0开始的,将NUM_CLASSES改为字典中字符个数+1即可
看这个报错的Op,可以试试作以下更改,同时保持FLAG打开,还会不会出这个问题: https://github.com/PaddlePaddle/models/blob/22b8805bb5a6a6ee28fb228fc4ce7fee4791bef1/PaddleCV/ocr_recognition/train.py#L119
改成:
build_strategy = fluid.BuildStrategy() build_strategy.fuse_all_reduce_ops = False train_exe = fluid.ParallelExecutor( use_cuda=True if args.use_gpu else False, loss_name=sum_cost.name, build_strategy=build_strategy)
问题的主要原因在于我的字典中字符序号是从1开始的,而模型训练的label是从0开始的,将NUM_CLASSES改为字典中字符个数+1即可
按照这种方式修改了以后,目前已经训练了2次了,还没有出现,如果后面复现了会再reopen,感谢解答!
报错信息是:NaN or Inf found in input tensor 此时的学习率已经调到0.0001,训练很慢了,但是还会出现NaN,请问该怎么处理这个问题?