PaddlePaddle / Paddle-Inference-Demo

Apache License 2.0
236 stars 156 forks source link

在win10系统下使用PaddleInference2.5编译ppyoloe_crn_l,出现如下问题, 请问如何解决? #519

Open dict1234 opened 4 months ago

dict1234 commented 4 months ago

ppyoloe_crn_l.exe --model_file ppyoloe_crn_l_300e_coco/model.pdmodel --params_file ppyoloe_crn_l_300e_coco/model.pdiparams

D:\000-AI\paddle\Deploy\2.5\Paddle-Inference-Demo-master\c++\gpu\ppyoloe_crn_l\build\Release>ppyoloe_crn_l.exe --model_file ppyoloe_crn_l_300e_coco/model.pdmodel --params_file ppyoloe_crn_l_300e_coco/model.pdiparams e[1me[35m--- Running analysis [ir_graph_build_pass]e[0m WARNING: Logging before InitGoogleLogging() is written to STDERR I0505 13:45:31.034536 5320 executor.cc:187] Old Executor is Running. e[1me[35m--- Running analysis [ir_analysis_pass]e[0m e[32m--- Running IR pass [map_op_to_another_pass]e[0m e[32m--- Running IR pass [identity_scale_op_clean_pass]e[0m e[32m--- Running IR pass [is_test_pass]e[0m e[32m--- Running IR pass [simplify_with_basic_ops_pass]e[0m e[32m--- Running IR pass [delete_quant_dequant_linear_op_pass]e[0m e[32m--- Running IR pass [delete_weight_dequant_linear_op_pass]e[0m e[32m--- Running IR pass [constant_folding_pass]e[0m e[32m--- Running IR pass [silu_fuse_pass]e[0m e[32m--- Running IR pass [conv_bn_fuse_pass]e[0m I0505 13:45:31.640825 5320 fuse_pass_base.cc:59] --- detected 78 subgraphs e[32m--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]e[0m e[32m--- Running IR pass [embedding_eltwise_layernorm_fuse_pass]e[0m e[32m--- Running IR pass [multihead_matmul_fuse_pass_v2]e[0m e[32m--- Running IR pass [vit_attention_fuse_pass]e[0m e[32m--- Running IR pass [fused_multi_transformer_encoder_pass]e[0m e[32m--- Running IR pass [fused_multi_transformer_decoder_pass]e[0m e[32m--- Running IR pass [fused_multi_transformer_encoder_fuse_qkv_pass]e[0m e[32m--- Running IR pass [fused_multi_transformer_decoder_fuse_qkv_pass]e[0m e[32m--- Running IR pass [multi_devices_fused_multi_transformer_encoder_pass]e[0m e[32m--- Running IR pass [multi_devices_fused_multi_transformer_encoder_fuse_qkv_pass]e[0m e[32m--- Running IR pass [multi_devices_fused_multi_transformer_decoder_fuse_qkv_pass]e[0m e[32m--- Running IR pass [fuse_multi_transformer_layer_pass]e[0m e[32m--- Running IR pass [gpu_cpu_squeeze2_matmul_fuse_pass]e[0m e[32m--- Running IR pass [gpu_cpu_reshape2_matmul_fuse_pass]e[0m e[32m--- Running IR pass [gpu_cpu_flatten2_matmul_fuse_pass]e[0m e[32m--- Running IR pass [gpu_cpu_map_matmul_v2_to_mul_pass]e[0m e[32m--- Running IR pass [gpu_cpu_map_matmul_v2_to_matmul_pass]e[0m e[32m--- Running IR pass [matmul_scale_fuse_pass]e[0m e[32m--- Running IR pass [multihead_matmul_fuse_pass_v3]e[0m e[32m--- Running IR pass [gpu_cpu_map_matmul_to_mul_pass]e[0m e[32m--- Running IR pass [fc_fuse_pass]e[0m e[32m--- Running IR pass [fc_elementwise_layernorm_fuse_pass]e[0m e[32m--- Running IR pass [conv_elementwise_add_act_fuse_pass]e[0m I0505 13:45:34.308128 5320 fuse_pass_base.cc:59] --- detected 9 subgraphs e[32m--- Running IR pass [conv_elementwise_add2_act_fuse_pass]e[0m e[32m--- Running IR pass [conv_elementwise_add_fuse_pass]e[0m I0505 13:45:34.541483 5320 fuse_pass_base.cc:59] --- detected 118 subgraphs e[32m--- Running IR pass [transpose_flatten_concat_fuse_pass]e[0m e[32m--- Running IR pass [conv2d_fusion_layout_transfer_pass]e[0m e[32m--- Running IR pass [transfer_layout_elim_pass]e[0m e[32m--- Running IR pass [auto_mixed_precision_pass]e[0m e[32m--- Running IR pass [inplace_op_var_pass]e[0m e[1me[35m--- Running analysis [save_optimized_model_pass]e[0m W0505 13:45:34.565424 5320 save_optimized_model_pass.cc:28] save_optim_cache_model is turned off, skip save_optimized_model_pass e[1me[35m--- Running analysis [ir_params_sync_among_devices_pass]e[0m I0505 13:45:34.566417 5320 ir_params_sync_among_devices_pass.cc:51] Sync params from CPU to GPU e[1me[35m--- Running analysis [adjust_cudnn_workspace_size_pass]e[0m e[1me[35m--- Running analysis [inference_op_replace_pass]e[0m e[1me[35m--- Running analysis [memory_optimize_pass]e[0m I0505 13:45:34.726987 5320 memory_optimize_pass.cc:222] Cluster name : tmp_2 size: 26214400 I0505 13:45:34.726987 5320 memory_optimize_pass.cc:222] Cluster name : batch_norm_2.tmp_2 size: 26214400 I0505 13:45:34.726987 5320 memory_optimize_pass.cc:222] Cluster name : image size: 4915200 I0505 13:45:34.727985 5320 memory_optimize_pass.cc:222] Cluster name : sigmoid_2.tmp_0 size: 26214400 I0505 13:45:34.727985 5320 memory_optimize_pass.cc:222] Cluster name : batch_norm_48.tmp_2 size: 1228800 I0505 13:45:34.727985 5320 memory_optimize_pass.cc:222] Cluster name : tmp_0 size: 13107200 I0505 13:45:34.728982 5320 memory_optimize_pass.cc:222] Cluster name : elementwise_add_0 size: 4915200 I0505 13:45:34.728982 5320 memory_optimize_pass.cc:222] Cluster name : tmp_7 size: 4915200 I0505 13:45:34.728982 5320 memory_optimize_pass.cc:222] Cluster name : elementwise_add_16 size: 614400 I0505 13:45:34.728982 5320 memory_optimize_pass.cc:222] Cluster name : pool2d_5.tmp_0 size: 768 I0505 13:45:34.729979 5320 memory_optimize_pass.cc:222] Cluster name : scale_factor size: 8 I0505 13:45:34.729979 5320 memory_optimize_pass.cc:222] Cluster name : shape_2.tmp_0_slice_0 size: 4 e[1me[35m--- Running analysis [ir_graph_to_program_pass]e[0m I0505 13:45:34.970336 5320 analysis_predictor.cc:1660] ======= optimize end ======= I0505 13:45:34.971334 5320 naive_executor.cc:164] --- skip [feed], feed -> scale_factor I0505 13:45:34.973328 5320 naive_executor.cc:164] --- skip [feed], feed -> image I0505 13:45:34.984299 5320 naive_executor.cc:164] --- skip [gather_nd_0.tmp_0], fetch -> fetch I0505 13:45:34.984299 5320 naive_executor.cc:164] --- skip [multiclass_nms3_0.tmp_2], fetch -> fetch W0505 13:45:34.987293 5320 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 12.2, Runtime API Version: 11.8 W0505 13:45:34.991281 5320 gpu_resources.cc:149] device: 0, cuDNN Version: 8.6.


C++ Traceback (most recent call last):

Not support stack backtrace yet.


Error Message Summary:

InvalidArgumentError: The axis is expected to be in range of [-1, 1), but got 1 [Hint: Expected axis_value >= -rank && axis_value < rank == true, but received axis_value >= -rank && axis_value < rank:0 != true:1.] (at ..\paddle\phi\infermeta\unary.cc:3567)

kangguangli commented 4 months ago

你好,感谢您的反馈,从上面的报错暂时看不出原因。你可以设置如下flag: export FLAGS_call_stack_level=2 这可以帮我们打印C++侧报错栈,提供更多信息。

另外,也可以参考 https://www.paddlepaddle.org.cn/inference/master/guides/performance_tuning/precision_tracing.html 确认是否为某个pass的问题。

最后,可以尝试更换到2.6版本,确认问题是否存在。

如果你尝试了以上办法,请务必将运行结果贴在这里,这对我们后续的分析定位会很有帮助。

dict1234 commented 4 months ago

win10系统下, 在cmd设置 set FLAGS_call_stack_level = 2 配置: paddle_inference2.6 CUDA11.8 cuDNN8.6 TensorRT8.5

Code: // Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. // You may obtain a copy of the License at // // http://www.apache.org/licenses/LICENSE-2.0 // // Unless required by applicable law or agreed to in writing, software // distributed under the License is distributed on an "AS IS" BASIS, // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License.

include

include

include

include

include <gflags/gflags.h>

include <glog/logging.h>

include "paddle_inference_api.h"

using paddle_infer::Config; using paddle_infer::CreatePredictor; using paddle_infer::PrecisionType; using paddle_infer::Predictor;

DEFINE_string(model_dir, "", "Directory of the inference model."); DEFINE_string(model_file, "", "Path of the inference model file."); DEFINE_string(params_file, "", "Path of the inference params file."); DEFINE_string(run_mode, "paddle_gpu", "run_mode which can be: trt_fp32, trt_fp16 and trt_int8 and paddle_gpu"); DEFINE_int32(batch_size, 1, "Batch size."); DEFINE_int32(gpu_id, 0, "GPU card ID num."); DEFINE_int32(trt_min_subgraph_size, 3, "tensorrt min_subgraph_size"); DEFINE_int32(warmup, 50, "warmup"); DEFINE_int32(repeats, 1000, "repeats"); DEFINE_bool(use_dynamic_shape, false, "use trt dynaminc shape."); DEFINE_bool(use_calib, true, "use trt int8 calibration."); DEFINE_bool(use_collect_shape, false, "Collect trt shape information"); DEFINE_string(dynamic_shape_file, "", "trt shape information name");

using Time = decltype(std::chrono::high_resolution_clock::now()); Time time() { return std::chrono::high_resolution_clock::now(); }; double time_diff(Time t1, Time t2) { typedef std::chrono::microseconds ms; auto diff = t2 - t1; ms counter = std::chrono::duration_cast(diff); return counter.count() / 1000.0; }

std::shared_ptr InitPredictor() { Config config; if (FLAGS_model_dir != "") { config.SetModel(FLAGS_model_dir); } config.SetModel(FLAGS_model_file, FLAGS_params_file);

config.EnableUseGpu(500, FLAGS_gpu_id);

if (FLAGS_run_mode == "trt_fp32")
{
    config.EnableTensorRtEngine(1 << 30 * FLAGS_batch_size,
        FLAGS_batch_size,
        FLAGS_trt_min_subgraph_size,
        PrecisionType::kFloat32,
        false,
        false);
}
else if (FLAGS_run_mode == "trt_fp16") 
{
    config.EnableTensorRtEngine(1 << 30 * FLAGS_batch_size,
        FLAGS_batch_size,
        FLAGS_trt_min_subgraph_size,
        PrecisionType::kHalf,
        false,
        false);
}
else if (FLAGS_run_mode == "trt_int8")
{
    config.EnableTensorRtEngine(1 << 30 * FLAGS_batch_size,
        FLAGS_batch_size,
        FLAGS_trt_min_subgraph_size,
        PrecisionType::kInt8,
        false,
        FLAGS_use_calib);
}

if (FLAGS_use_dynamic_shape && FLAGS_use_collect_shape)
{
    config.CollectShapeRangeInfo(FLAGS_dynamic_shape_file);
}
else if (FLAGS_use_dynamic_shape && !FLAGS_use_collect_shape)
{
    config.EnableTunedTensorRtDynamicShape(FLAGS_dynamic_shape_file);
}
// Open the memory optim.
config.EnableMemoryOptim();
config.SwitchIrOptim(true);
return CreatePredictor(config);

}

void run(Predictor predictor,const std::vector& input,const std::vector& input_shape,std::vector out_data) { int input_num = std::accumulate(input_shape.begin(), input_shape.end(), 1, std::multiplies());

auto input_names = predictor->GetInputNames();
auto output_names = predictor->GetOutputNames();
auto input_t = predictor->GetInputHandle(input_names[0]);
LOG(INFO) << "[" << __FUNCTION__ << ":" << __LINE__ << "]" << "[run]获得句柄...";
input_t->Reshape(input_shape);
input_t->CopyFromCpu(input.data());
LOG(INFO) << "[" << __FUNCTION__ << ":" << __LINE__ << "]" << "[run]FLAGS_warmup...";
for (size_t i = 0; i < FLAGS_warmup; ++i)
{
    CHECK(predictor->Run());
}
LOG(INFO) << "[" << __FUNCTION__ << ":" << __LINE__ << "]" << "[run]开始循环预测...";
auto st = time();
for (size_t i = 0; i < FLAGS_repeats; ++i)
{
    LOG(INFO) << "[run]..." << i;
    CHECK(predictor->Run());
    auto output_t = predictor->GetOutputHandle(output_names[0]);
    std::vector<int> output_shape = output_t->shape();
    int out_num = std::accumulate(output_shape.begin(), output_shape.end(), 1, std::multiplies<int>());
    out_data->resize(out_num);
    output_t->CopyToCpu(out_data->data());
    LOG(INFO) << "[run] out_data:" << out_data->size();
}
LOG(INFO) << "[" << __FUNCTION__ << ":" << __LINE__ << "]" << "run avg time is " << time_diff(st, time()) / FLAGS_repeats << " ms";

}

int main(int argc, char argv[]) { google::ParseCommandLineFlags(&argc, &argv, true); LOG(INFO) << "[" << FUNCTION << ":" << LINE << "]" << "初始化预测器..."; auto predictor = InitPredictor(); LOG(INFO) << "[" << FUNCTION << ":" << LINE << "]" << "初始化图像数据格式..."; std::vector input_shape = { FLAGS_batch_size, 3, 640, 640 }; std::vector input_data(FLAGS_batch_size 3 640 640); LOG(INFO) << "[" << FUNCTION << ":" << LINE << "]" << "初始化图像数据..."; for (size_t i = 0; i < input_data.size(); ++i) input_data[i] = i % 255 * 0.1; std::vector out_data; LOG(INFO) << "[" << FUNCTION << ":" << LINE << "]" << "开始检测..."; run(predictor.get(), input_data, input_shape, &out_data); LOG(INFO) << "[" << FUNCTION << ":" << LINE << "]" << "检测完成..." << "out_data:" << out_data.size();

return 0;

}

Problem: D:\000-AI\paddle\Deploy\2.6\Paddle-Inference-Demo-master\c++\gpu\ppyoloe_crn_l\build\Release>ppyoloe_crn_l.exe --model_file ppyoloe_crn_l_300e_coco/model.pdmodel --params_file ppyoloe_crn_l_300e_coco/model.pdiparams WARNING: Logging before InitGoogleLogging() is written to STDERR I0507 08:33:49.911533 8644 ppyoloe_crn_l.cc:142] [main:142]初始化预测器... e[1me[35m--- Running analysis [ir_graph_build_pass]e[0m WARNING: Logging before InitGoogleLogging() is written to STDERR I0507 08:33:51.262887 8644 executor.cc:187] Old Executor is Running. e[1me[35m--- Running analysis [ir_analysis_pass]e[0m e[32m--- Running IR pass [map_op_to_another_pass]e[0m e[32m--- Running IR pass [is_test_pass]e[0m e[32m--- Running IR pass [simplify_with_basic_ops_pass]e[0m e[32m--- Running IR pass [delete_quant_dequant_linear_op_pass]e[0m e[32m--- Running IR pass [delete_weight_dequant_linear_op_pass]e[0m e[32m--- Running IR pass [constant_folding_pass]e[0m I0507 08:33:51.661852 8644 fuse_pass_base.cc:59] --- detected 13 subgraphs e[32m--- Running IR pass [silu_fuse_pass]e[0m e[32m--- Running IR pass [conv_bn_fuse_pass]e[0m I0507 08:33:51.857331 8644 fuse_pass_base.cc:59] --- detected 78 subgraphs e[32m--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]e[0m e[32m--- Running IR pass [embedding_eltwise_layernorm_fuse_pass]e[0m e[32m--- Running IR pass [multihead_matmul_fuse_pass_v2]e[0m e[32m--- Running IR pass [vit_attention_fuse_pass]e[0m e[32m--- Running IR pass [fused_multi_transformer_encoder_pass]e[0m e[32m--- Running IR pass [fused_multi_transformer_decoder_pass]e[0m e[32m--- Running IR pass [fused_multi_transformer_encoder_fuse_qkv_pass]e[0m e[32m--- Running IR pass [fused_multi_transformer_decoder_fuse_qkv_pass]e[0m e[32m--- Running IR pass [multi_devices_fused_multi_transformer_encoder_pass]e[0m e[32m--- Running IR pass [multi_devices_fused_multi_transformer_encoder_fuse_qkv_pass]e[0m e[32m--- Running IR pass [multi_devices_fused_multi_transformer_decoder_fuse_qkv_pass]e[0m e[32m--- Running IR pass [fuse_multi_transformer_layer_pass]e[0m e[32m--- Running IR pass [gpu_cpu_squeeze2_matmul_fuse_pass]e[0m e[32m--- Running IR pass [gpu_cpu_reshape2_matmul_fuse_pass]e[0m e[32m--- Running IR pass [gpu_cpu_flatten2_matmul_fuse_pass]e[0m e[32m--- Running IR pass [gpu_cpu_map_matmul_v2_to_mul_pass]e[0m e[32m--- Running IR pass [gpu_cpu_map_matmul_v2_to_matmul_pass]e[0m e[32m--- Running IR pass [matmul_scale_fuse_pass]e[0m e[32m--- Running IR pass [multihead_matmul_fuse_pass_v3]e[0m e[32m--- Running IR pass [gpu_cpu_map_matmul_to_mul_pass]e[0m e[32m--- Running IR pass [fc_fuse_pass]e[0m e[32m--- Running IR pass [fc_elementwise_layernorm_fuse_pass]e[0m e[32m--- Running IR pass [conv_elementwise_add_act_fuse_pass]e[0m I0507 08:33:54.501273 8644 fuse_pass_base.cc:59] --- detected 9 subgraphs e[32m--- Running IR pass [conv_elementwise_add2_act_fuse_pass]e[0m e[32m--- Running IR pass [conv_elementwise_add_fuse_pass]e[0m I0507 08:33:54.719719 8644 fuse_pass_base.cc:59] --- detected 118 subgraphs e[32m--- Running IR pass [transpose_flatten_concat_fuse_pass]e[0m e[32m--- Running IR pass [fused_conv2d_add_act_layout_transfer_pass]e[0m e[32m--- Running IR pass [transfer_layout_elim_pass]e[0m I0507 08:33:54.738648 8644 transfer_layout_elim_pass.cc:346] move down 0 transfer_layout I0507 08:33:54.738648 8644 transfer_layout_elim_pass.cc:347] eliminate 0 pair of transfer_layout e[32m--- Running IR pass [auto_mixed_precision_pass]e[0m e[32m--- Running IR pass [identity_op_clean_pass]e[0m e[32m--- Running IR pass [inplace_op_var_pass]e[0m I0507 08:33:54.753597 8644 fuse_pass_base.cc:59] --- detected 11 subgraphs e[1me[35m--- Running analysis [save_optimized_model_pass]e[0m e[1me[35m--- Running analysis [ir_params_sync_among_devices_pass]e[0m I0507 08:33:54.756589 8644 ir_params_sync_among_devices_pass.cc:53] Sync params from CPU to GPU e[1me[35m--- Running analysis [adjust_cudnn_workspace_size_pass]e[0m e[1me[35m--- Running analysis [inference_op_replace_pass]e[0m e[1me[35m--- Running analysis [memory_optimize_pass]e[0m I0507 08:33:54.897214 8644 memory_optimize_pass.cc:118] The persistable params in main graph are : 199.204MB I0507 08:33:54.906189 8644 memory_optimize_pass.cc:246] Cluster name : tmp_2 size: 26214400 I0507 08:33:54.906189 8644 memory_optimize_pass.cc:246] Cluster name : batch_norm_2.tmp_2 size: 26214400 I0507 08:33:54.906189 8644 memory_optimize_pass.cc:246] Cluster name : tmp_68 size: 1228800 I0507 08:33:54.906189 8644 memory_optimize_pass.cc:246] Cluster name : image size: 4915200 I0507 08:33:54.907187 8644 memory_optimize_pass.cc:246] Cluster name : sigmoid_2.tmp_0 size: 26214400 I0507 08:33:54.907187 8644 memory_optimize_pass.cc:246] Cluster name : tmp_0 size: 13107200 I0507 08:33:54.907187 8644 memory_optimize_pass.cc:246] Cluster name : elementwise_add_0 size: 4915200 I0507 08:33:54.908183 8644 memory_optimize_pass.cc:246] Cluster name : tmp_7 size: 4915200 I0507 08:33:54.909183 8644 memory_optimize_pass.cc:246] Cluster name : scale_factor size: 8 I0507 08:33:54.910179 8644 memory_optimize_pass.cc:246] Cluster name : pool2d_1.tmp_0 size: 614400 I0507 08:33:54.918157 8644 memory_optimize_pass.cc:246] Cluster name : pool2d_5.tmp_0 size: 768 I0507 08:33:54.919155 8644 memory_optimize_pass.cc:246] Cluster name : shape_2.tmp_0_slice_0 size: 4 e[1me[35m--- Running analysis [ir_graph_to_program_pass]e[0m I0507 08:33:55.138567 8644 analysis_predictor.cc:1838] ======= optimize end ======= I0507 08:33:55.139565 8644 naive_executor.cc:200] --- skip [feed], feed -> scale_factor I0507 08:33:55.139565 8644 naive_executor.cc:200] --- skip [feed], feed -> image I0507 08:33:55.150537 8644 naive_executor.cc:200] --- skip [gather_nd_0.tmp_0], fetch -> fetch I0507 08:33:55.150537 8644 naive_executor.cc:200] --- skip [multiclass_nms3_0.tmp_2], fetch -> fetch I0507 08:33:55.151533 8644 ppyoloe_crn_l.cc:144] [main:144]初始化图像数据格式... I0507 08:33:55.152530 8644 ppyoloe_crn_l.cc:147] [main:147]初始化图像数据... I0507 08:33:55.154525 8644 ppyoloe_crn_l.cc:150] [main:150]开始检测... I0507 08:33:55.154525 8644 ppyoloe_crn_l.cc:115] [run:115][run]获得句柄... W0507 08:33:55.154525 8644 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 12.2, Runtime API Version: 11.8 W0507 08:33:55.157517 8644 gpu_resources.cc:164] device: 0, cuDNN Version: 8.6. I0507 08:33:55.158514 8644 ppyoloe_crn_l.cc:118] [run:118][run]FLAGS_warmup...


C++ Traceback (most recent call last):

Not support stack backtrace yet.


Error Message Summary:

InvalidArgumentError: The axis is expected to be in range of [1, -1), but got 1 [Hint: Expected axis_value >= -rank && axis_value < rank == true, but received axis_value >= -rank && axis_value < rank:0 != true:1.] (at ..\paddle\phi\infermeta\unary.cc:3814)

kangguangli commented 4 months ago

你好,现在看来应该是内部问题,有一个输入没有被初始化,我们会尽快提PR修复,合入后会及时在PR里同步。

kangguangli commented 4 months ago

@dict1234 #520 应该修复了,可以拉下最新代码试试。

lizexu123 commented 4 months ago

已经修复了

dict1234 commented 4 months ago

你好, 我昨天也调试了一下. 挨个观察, 发现是run_mode的设置问题. 默认是paddle_gpu, 改成trt_xxx就可以了.

dict1234 commented 4 months ago

另外请教一个问题. 使用TensorRT的情况下, 加载模型需要时间太长. 有没有办法在几秒或者几十秒内完成?

lizexu123 commented 4 months ago

原生gpu改了应该也要正确,确实是个bug

发自我的iPhone

------------------ 原始邮件 ------------------ 发件人: zhaoke @.> 发送时间: 2024年5月8日 08:13 收件人: PaddlePaddle/Paddle-Inference-Demo @.> 抄送: lizexu123 @.>, Comment @.> 主题: Re: [PaddlePaddle/Paddle-Inference-Demo] 在win10系统下使用PaddleInference2.5编译ppyoloe_crn_l,出现如下问题, 请问如何解决? (Issue #519)

你好, 我昨天也调试了一下. 挨个观察, 发现是run_mode的设置问题. 默认是paddle_gpu, 改成trt_xxx就可以了.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

kangguangli commented 4 months ago

你好, 我昨天也调试了一下. 挨个观察, 发现是run_mode的设置问题. 默认是paddle_gpu, 改成trt_xxx就可以了.

理论上所有run_mode应该都能跑通,现在我们已经修复了原生GPU下的问题,可以尝试下原生GPU。使用 TRT加载时间太长可能跟TRT的图优化过程有关,可以分享下你现在具体加载的用时。

dict1234 commented 4 months ago

W0508 15:06:45.928721 8180 op_compat_sensible_pass.cc:232] Check the Attr(axis) of Op(elementwise_add) in pass(conv_elementwise_add_fuse_pass) failed! W0508 15:06:45.929718 8180 conv_elementwise_add_fuse_pass.cc:94] Pass in op compat failed. W0508 15:06:45.929718 8180 op_compat_sensible_pass.cc:232] Check the Attr(axis) of Op(elementwise_add) in pass(conv_elementwise_add_fuse_pass) failed! W0508 15:06:45.929718 8180 conv_elementwise_add_fuse_pass.cc:94] Pass in op compat failed. W0508 15:06:45.929718 8180 op_compat_sensible_pass.cc:232] Check the Attr(axis) of Op(elementwise_add) in pass(conv_elementwise_add_fuse_pass) failed! W0508 15:06:45.929718 8180 conv_elementwise_add_fuse_pass.cc:94] Pass in op compat failed. I0508 15:06:45.944679 8180 fuse_pass_base.cc:59] --- detected 78 subgraphs e[32m--- Running IR pass [remove_padding_recover_padding_pass]e[0m e[32m--- Running IR pass [delete_remove_padding_recover_padding_pass]e[0m e[32m--- Running IR pass [dense_fc_to_sparse_pass]e[0m e[32m--- Running IR pass [dense_multihead_matmul_to_sparse_pass]e[0m e[32m--- Running IR pass [tensorrt_subgraph_pass]e[0m I0508 15:06:45.996539 8180 tensorrt_subgraph_pass.cc:302] --- detect a sub-graph with 387 nodes I0508 15:06:46.060369 8180 tensorrt_subgraph_pass.cc:846] Prepare TRT engine (Optimize model structure, Select OP kernel etc). This process may cost a lot of time. W0508 15:06:48.557268 8180 helper.h:127] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See CUDA_MODULE_LOADING in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars W0508 15:06:48.557268 8180 helper.h:127] The implicit batch dimension mode has been deprecated. Please create the network with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag whenever possible. W0508 15:06:48.560261 8180 place.cc:161] The paddle::PlaceType::kCPU/kGPU is deprecated since version 2.3, and will be removed in version 2.4! Please use Tensor::is_cpu()/is_gpu() method to determine the type of place. W0508 15:06:48.563252 8180 helper.h:127] Tensor DataType is determined at build time for tensors not marked as input or output. W0508 15:06:48.570233 8180 helper.h:127] Tensor DataType is determined at build time for tensors not marked as input or output. W0508 15:06:48.581204 8180 helper.h:127] Tensor DataType is determined at build time for tensors not marked as input or output. W0508 15:06:48.591177 8180 helper.h:127] Tensor DataType is determined at build time for tensors not marked as input or output. I0508 15:06:48.701881 8180 engine.cc:215] Run Paddle-TRT FP16 mode W0508 15:09:45.615533 8180 helper.h:127] TensorRT encountered issues when converting weights between types and that could affect accuracy. W0508 15:09:45.615533 8180 helper.h:127] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights. W0508 15:09:45.616530 8180 helper.h:127] Check verbose logs for the list of affected weights. W0508 15:09:45.616530 8180 helper.h:127] - 138 weights are affected by this issue: Detected subnormal FP16 values. W0508 15:09:45.616530 8180 helper.h:127] - 63 weights are affected by this issue: Detected values less than smallest positive FP16 subnormal value and converted them to the FP16 minimum subnormalized value. e[32m--- Running IR pass [conv_bn_fuse_pass]e[0m e[32m--- Running IR pass [conv_elementwise_add_act_fuse_pass]e[0m e[32m--- Running IR pass [conv_elementwise_add2_act_fuse_pass]e[0m e[32m--- Running IR pass [transpose_flatten_concat_fuse_pass]e[0m e[32m--- Running IR pass [auto_mixed_precision_pass]e[0m e[1me[35m--- Running analysis [save_optimized_model_pass]e[0m e[1me[35m--- Running analysis [ir_params_sync_among_devices_pass]e[0m I0508 15:09:45.832952 8180 ir_params_sync_among_devices_pass.cc:53] Sync params from CPU to GPU e[1me[35m--- Running analysis [adjust_cudnn_workspace_size_pass]e[0m e[1me[35m--- Running analysis [inference_op_replace_pass]e[0m e[1me[35m--- Running analysis [memory_optimize_pass]e[0m I0508 15:09:45.852897 8180 memory_optimize_pass.cc:118] The persistable params in main graph are : 199.14MB I0508 15:09:45.854892 8180 memory_optimize_pass.cc:246] Cluster name : multiclass_nms3_0.tmp_1 size: 4 e[1me[35m--- Running analysis [ir_graph_to_program_pass]e[0m

中间2行: I0508 15:06:48.701881 8180 engine.cc:215] Run Paddle-TRT FP16 mode W0508 15:09:45.615533 8180 helper.h:127] TensorRT encountered issues when converting weights between types and that could affect accuracy.

15:06:48 到 15:09:45 时间用了将近3分钟.

另外请问 原生GPU , 是指你们已经更新了SDK包吗? 在哪里下载? 可以给出链接吗?

kangguangli commented 4 months ago

原生GPU指你一开始的配置,即不使用TRT。更新的话主要更新的是本仓库,你拉下这个仓库的最新commit就行,或者手动仿照 #520 更新下 c++/gpu/ppyoloe_crn_l/ppyoloe_crn_l.cc即可。

关于TRT的时间问题,我会反馈给相关同事,短期内可能没法修复。