PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.07k stars 5.54k forks source link

squeeze_op 在inference模式下导致产生错误 #21828

Closed Meiyim closed 3 years ago

Meiyim commented 4 years ago

测试环境:cuda 9, cudnn 7.0.3 采用C++代码对inference_model进行前向预测。 相关config配置如下:

  paddle::AnalysisConfig config;
  config.SetModel(FLAGS_model_dir);
  config.EnableUseGpu(100, 0);
  config.SwitchSpecifyInputNames(true);
  config.EnableCUDNN();
  config.SwitchIrOptim(true);
  config.EnableMemoryOptim();

若采用fluid_inference 1.6.0 则直接出core没有信息 若菜用fluid_inference develop,版本信息如下:

GIT COMMIT ID: 0fe16539ef3651966080d5ae96850da4557751e0
WITH_MKL: ON
WITH_MKLDNN: ON
WITH_GPU: ON
CUDA version: 9.0
CUDNN version: v7

运行log如下:

--- Running analysis [ir_graph_build_pass]
--- Running analysis [ir_graph_clean_pass]
--- Running analysis [ir_analysis_pass]
--- Running IR pass [cudnn_placement_pass]
--- Running IR pass [is_test_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [conv_affine_channel_fuse_pass]
--- Running IR pass [conv_eltwiseadd_affine_channel_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]
--- Running IR pass [multihead_matmul_fuse_pass]
--- Running IR pass [fc_fuse_pass]
I1219 10:18:14.103123 63015 graph_pattern_detector.cc:101] ---  detected 12 subgraphs
I1219 10:18:14.138665 63015 graph_pattern_detector.cc:101] ---  detected 62 subgraphs
--- Running IR pass [fc_elementwise_layernorm_fuse_pass]
I1219 10:18:14.181339 63015 graph_pattern_detector.cc:101] ---  detected 24 subgraphs
--- Running IR pass [conv_elementwise_add_act_fuse_pass]
--- Running IR pass [conv_elementwise_add2_act_fuse_pass]
--- Running IR pass [conv_elementwise_add_fuse_pass]
--- Running IR pass [transpose_flatten_concat_fuse_pass]
--- Running IR pass [runtime_context_cache_pass]
--- Running analysis [ir_params_sync_among_devices_pass]
I1219 10:18:14.225345 63015 ir_params_sync_among_devices_pass.cc:41] Sync params from CPU to GPU
--- Running analysis [adjust_cudnn_workspace_size_pass]
--- Running analysis [inference_op_replace_pass]
--- Running analysis [memory_optimize_pass]
I1219 10:18:14.448055 63015 memory_optimize_pass.cc:223] Cluster name : expand_1.tmp_0  size: 1572864
I1219 10:18:14.448089 63015 memory_optimize_pass.cc:223] Cluster name : cast_6.tmp_0  size: 786432
I1219 10:18:14.448096 63015 memory_optimize_pass.cc:223] Cluster name : where_0.tmp_0  size: 16
I1219 10:18:14.448108 63015 memory_optimize_pass.cc:223] Cluster name : fc_25.tmp_1  size: 3072
I1219 10:18:14.448114 63015 memory_optimize_pass.cc:223] Cluster name : layer_norm_4.tmp_2  size: 3072
I1219 10:18:14.448122 63015 memory_optimize_pass.cc:223] Cluster name : scatter_nd_add_22.tmp_0  size: 3072
I1219 10:18:14.448128 63015 memory_optimize_pass.cc:223] Cluster name : scatter_nd_add_23.tmp_0  size: 3072
I1219 10:18:14.448134 63015 memory_optimize_pass.cc:223] Cluster name : layer_norm_14.tmp_2  size: 3072
I1219 10:18:14.448140 63015 memory_optimize_pass.cc:223] Cluster name : shape_1.tmp_0  size: 12
I1219 10:18:14.448146 63015 memory_optimize_pass.cc:223] Cluster name : layer_norm_0.tmp_2  size: 393216
I1219 10:18:14.448158 63015 memory_optimize_pass.cc:223] Cluster name : eval_placeholder_1  size: 1024
--- Running analysis [ir_graph_to_program_pass]
I1219 10:18:14.516865 63015 analysis_predictor.cc:471] ======= optimize end =======
W1219 10:18:15.083250 63015 device_context.cc:236] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 9.2, Runtime API Version: 9.0
W1219 10:18:15.088192 63015 device_context.cc:244] device: 0, cuDNN Version: 7.3.
terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
  what():

--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0   std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
1   paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
2   paddle::platform::CUDADeviceContext::Wait() const
3   paddle::framework::TransDataDevice(paddle::framework::Tensor const&, paddle::platform::Place const&, paddle::framework::Tensor*)
4   paddle::framework::TransformData(paddle::framework::OpKernelType const&, paddle::framework::OpKernelType const&, paddle::framework::Tensor const&, paddle::framework::Tensor*)
5   paddle::framework::OperatorWithKernel::PrepareData(paddle::framework::Scope const&, paddle::framework::OpKernelType const&, std::vector<std::string, std::allocator<std::string> >*, paddle::framework::RuntimeContext*) const
6   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
7   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
8   paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
9   paddle::framework::NaiveExecutor::Run()
10  paddle::AnalysisPredictor::Run(std::vector<paddle::PaddleTensor, std::allocator<paddle::PaddleTensor> > const&, std::vector<paddle::PaddleTensor, std::allocator<paddle::PaddleTensor> >*, int)

------------------------------------------
Python Call Stacks (More useful to users):
------------------------------------------
  File "/home/work/chenxuyi/dis/pp/fine/mergeWeb/app/lib/python3.6/site-packages/paddle/fluid/framework.py", line 2488, in append_op
    attrs=kwargs.get("attrs", None))
  File "/home/work/chenxuyi/dis/pp/fine/mergeWeb/app/lib/python3.6/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op
    return self.main_program.current_block().append_op(*args, **kwargs)
  File "/home/work/chenxuyi/dis/pp/fine/mergeWeb/app/lib/python3.6/site-packages/paddle/fluid/layers/nn.py", line 9105, in squeeze
    "XShape": x_shape})
  File "/home/work/chenxuyi/gitlab/paddle-models/model/transformer_encoder.py", line 372, in encoder
    pad_idx = L.where(L.cast(L.squeeze(input_mask, axes=[2]), 'bool'))
  File "/home/work/chenxuyi/gitlab/paddle-models/model/ernie.py", line 187, in _build_model
    name='encoder')
  File "/home/work/chenxuyi/gitlab/paddle-models/model/ernie.py", line 124, in __init__
    input_mask)
  File "./ernie/xnli.py", line 57, in forward
    use_fp16=self.hparam['use_fp16']
  File "/home/work/chenxuyi/dis/pp/fine/mergeWeb/paddle-estimator/propeller/paddle/train/trainer.py", line 147, in _model_fn
    pred = model.forward(fea)
  File "/home/work/chenxuyi/dis/pp/fine/mergeWeb/paddle-estimator/propeller/paddle/train/trainer.py", line 83, in _build_net
    features=features, mode=mode, params=params, run_config=run_config)
  File "/home/work/chenxuyi/dis/pp/fine/mergeWeb/paddle-estimator/propeller/paddle/train/trainer.py", line 230, in _build_for_eval
    self.params, self.run_config)
  File "/home/work/chenxuyi/dis/pp/fine/mergeWeb/paddle-estimator/propeller/paddle/train/trainer.py", line 482, in __init__
    0])  #eval_datasets must have same output shapes
  File "/home/work/chenxuyi/dis/pp/fine/mergeWeb/paddle-estimator/propeller/paddle/train/trainer.py", line 530, in train_and_eval
    train_hooks.append(_EvalHookOnTrainLoop())
  File "./ernie/xnli.py", line 221, in <module>
    exporters=[best_exporter])

----------------------
Error Message Summary:
----------------------
FatalError: cudaStreamSynchronize raises error: unspecified launch failure, errono: 4: unspecified launch failure at (/work/paddle/fluid/platform/device_context.cc:330)
  [operator < squeeze2 > error]
./gpu.sh: line 13: 63015 Aborted                 (core dumped) ./build/inference --logtostderr --model_dir $2 --data $1 --repeat 1 --output_prediction true --use_gpu true --device 0

截取部分组网代码,贴在下面:

    d_shape = L.shape(L.cast(enc_input, 'float32'))
    input_hidden_dim = enc_input.shape[-1]
    pad_idx = L.where(L.cast(L.squeeze(input_mask, axes=[2]), 'bool')) #!!!!!!!!!!!!!
    attn_bias = L.matmul(input_mask, input_mask, transpose_y=True) 
    attn_bias = (1. - attn_bias) * -10000.
    attn_bias = L.unsqueeze(attn_bias, axes=[1])
    attn_bias = L.expand(attn_bias, [1, n_head, 1, 1]) 
    if attn_bias.dtype != enc_input.dtype:
        attn_bias = L.cast(attn_bias, enc_input.dtype)
danleifeng commented 4 years ago

已转给预测的同学跟进。

jiweibo commented 4 years ago

请问能否提供下测试环境(代码和模型),先尝试复现以下问题

paddle-bot-old[bot] commented 3 years ago

Since you haven\'t replied for more than a year, we have closed this issue/pr. If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. 由于您超过一年未回复,我们将关闭这个issue/pr。 若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。