Closed paulgheorghecristian closed 5 years ago
Hi @paulgheorghecristian
Could you please rebuild with debug=1 and share the callstack throwing the exception?
Hello and thank you for your reply!
I have passed the bad_alloc problem. Now I have a segmentation fault with the following call stack:
bias=0x274638, im2col=0x2766d8, hwcn_weights=0x276708, output=0x274668) at tensorflow/contrib/lite/kernels/conv.cc:515
I guess I do not know how to use your convolution(NEConvolutionLayer) with Tensor::alocator()->import_memory();
Forget the last comment, I have arrived at a segmentation fault which seems to be because of the framework: In NEGEMMConvolutionLayer.cpp at line 296: configure_mm(gemm_input_to_use, &_weights_reshaped, gemm_output_to_use, gemm_3d_depth);
Tensor _weights_reshaped seems to have the buffer NULL after calling: _reshape_weights.configure(weights, biases_to_use, &_weights_reshaped);
Hi @paulgheorghecristian
Could you please try updating to 19.02? It would be helpful to see your cpp file too.
On 19.02 I have the same issue. This is the full stack trace:
at ./src/core/NEON/kernels/assembly/../arm_gemm/transforms/a32_transpose_interleave_8way_32bit.hpp:115
at ./src/core/NEON/kernels/assembly/../arm_gemm/transforms/transpose_interleave_common.hpp:85
at ./src/core/NEON/kernels/assembly/../arm_gemm/transforms/a32_transpose_interleave_8way_32bit.hpp:124
at ./src/core/NEON/kernels/assembly/../arm_gemm/transforms/a32_transpose_interleave_8way_32bit.hpp:38
at ./src/core/NEON/kernels/assembly/../arm_gemm/kernels/../std_transforms_fixed.hpp:59
at ./arm_compute/core/NEON/kernels/assembly/NEGEMMInterleavedPrepareBWrapperKernel.h:213
&)#1}> (w=..., id=..., lambda_function=...) at ./arm_compute/core/Helpers.inl:106
lambda_function=...) at ./arm_compute/core/Helpers.inl:132
at ./arm_compute/core/NEON/kernels/assembly/NEGEMMInterleavedPrepareBWrapperKernel.h:77
at ./arm_compute/core/NEON/kernels/assembly/NEGEMMInterleavedPrepareBWrapperKernel.h:228
The code is(I first want to make it work without importing memory, that's why the import code is commented):
` arm_compute::Tensor input_arm;
arm_compute::Tensor weights_arm;
arm_compute::Tensor bias_arm;
arm_compute::Tensor output_arm;
arm_compute::Allocator allocator{};
std::unique_ptr
auto lifetime_mgr1 = std::make_shared
auto pool_mgr0 = std::make_shared
auto pool_mgr1 = std::make_shared
auto mm_layers = std::make_shared
conv_arm = arm_compute::support::cpp14::make_unique
memory_group0 = arm_compute::support::cpp14::make_unique<arm_compute::MemoryGroup>(mm_transitions);
memory_group1 = arm_compute::support::cpp14::make_unique<arm_compute::MemoryGroup>(mm_transitions);
arm_compute::TensorInfo output_info(arm_compute::TensorShape(
output->dims->data[3],
output->dims->data[2],
output->dims->data[1],
output->dims->data[0]), 1, arm_compute::DataType::F32);
output_info.set_data_layout(arm_compute::DataLayout::NHWC);
arm_compute::TensorInfo input_info(arm_compute::TensorShape(
input->dims->data[3],
input->dims->data[2],
input->dims->data[1],
input->dims->data[0]), 1, arm_compute::DataType::F32);
input_info.set_data_layout(arm_compute::DataLayout::NHWC);
arm_compute::TensorInfo weights_info(arm_compute::TensorShape(filter->dims->data[3],
filter->dims->data[2],
filter->dims->data[1],
filter->dims->data[0]), 1, arm_compute::DataType::F32);
weights_info.set_data_layout(arm_compute::DataLayout::NHWC);
arm_compute::TensorInfo bias_info(arm_compute::TensorShape(filter->dims->data[0]),
1, arm_compute::DataType::F32);
output_arm.allocator()->init(arm_compute::TensorInfo(output_info));
input_arm.allocator()->init(arm_compute::TensorInfo(input_info));
weights_arm.allocator()->init(arm_compute::TensorInfo(weights_info));
bias_arm.allocator()->init(arm_compute::TensorInfo(bias_info));
/*
input_arm.allocator()->import_memory(GetTensorData<float>(input), input_info.total_size());
weights_arm.allocator()->import_memory(GetTensorData<float>(filter), weights_info.total_size());
bias_arm.allocator()->import_memory(GetTensorData<float>(bias), bias_info.total_size());
output_arm.allocator()->import_memory(GetTensorData<float>(output), output_info.total_size());
*/
conv_arm->configure(&input_arm, &weights_arm, &bias_arm, &output_arm, arm_compute::PadStrideInfo(params->stride_width,
params->stride_height,
data->padding.width, data->padding.height, arm_compute::DimensionRoundingType::CEIL),
arm_compute::WeightsInfo(false, filter->dims->data[2], filter->dims->data[1], filter->dims->data[0], true), arm_compute::Size2D(params->dilation_width_factor,
params->dilation_height_factor),
arm_compute::ActivationLayerInfo(arm_compute::ActivationLayerInfo::ActivationFunction::LU_BOUNDED_RELU,
output_activation_max, output_activation_min));
input_info.set_num_channels(4);
weights_info.set_num_channels(4);
output_info.set_num_channels(4);
bias_info.set_num_channels(1);
memory_group0->manage(&output_arm);
output_arm.allocator()->allocate();
memory_group1->manage(&input_arm);
input_arm.allocator()->allocate();
weights_arm.allocator()->allocate();
bias_arm.allocator()->allocate();
mm_layers->populate(allocator, 1);
mm_transitions->populate(allocator, 2);
memory_group0->acquire();
memory_group1->acquire();
conv_arm->run();
memory_group0->release();
memory_group1->release();`
Ok, I managed to get past all problems and accuracy is the same. Unfortunately it is slower than TF Lite's conv by about 80 ms. Is this expected? (both on 4 threads) (GEMM gives me 300 ms and acc is the same, DIRECT gives me 179 ms, but accuracy is 0 instead of 75) (on version 19.02)
Hi @paulgheorghecristian
Could please share the shapes you use to configure the convolution layer. For FP32 on armv8a I would expect ACL to give up to a 2x speedup for most networks when comparing with TF cpu. You are running this on armv7 and that might not be as optimised as our v8 path.
Yes, I am using armv7. The shapes are that of MobileNetV2 and all of them have the NHWC format.
For example, the first convolution is: output 32 112 112 1 input 3 224 224 1 filter 3 3 3 32 (these are the exact shapes fed into ACL)
I ran it also on armv8 and it's still slower....maybe I'm doing something wrong?
I have a question, do you optimize 1x1 convs in some way?
Ok, it now runs faster than TF Lite with about 50 ms. Sorry for the spam and thank you for this wonderful library.
Ok, it now runs faster than TF Lite with about 50 ms. Sorry for the spam and thank you for this wonderful library.
Output of 'strings libarm_compute.so | grep arm_compute_version': arm_compute_version=v18.08 Platform: Raspberry PI ARM v7 Operating System: Distributor ID: Raspbian Description: Raspbian GNU/Linux 9.8 (stretch) Release: 9.8 Codename: stretch
Problem description: I tried using arm_compute::NEConvolutionLayer to speed up inference time, but arm_compute::NEConvolutionLayer::configure throws std::bad_alloc.
`arm_compute::Tensor input_arm; arm_compute::Tensor weights_arm; arm_compute::Tensor bias_arm; arm_compute::Tensor output_arm; arm_compute::NEConvolutionLayer conv_arm;
Valgrind gives me this: ==19553== Argument 'size' of function __builtin_vec_new has a fishy (possibly negative) value: -666472 ==19553== at 0x48485F0: operator new[](unsigned int) (vg_replace_malloc.c:417) ==19553== by 0xDAA79: arm_compute::support::cpp14::_Unique_if::_Single_object arm_compute::support::cpp14::make_unique<arm_compute::MemoryRegion, unsigned int, unsigned int>(unsigned int&&, unsigned int&&) (in /home/pi/EDL/inference_template)