ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
MIT License
2.82k stars 774 forks source link

std::bad_alloc when trying to run a convolution #677

Closed paulgheorghecristian closed 5 years ago

paulgheorghecristian commented 5 years ago

Output of 'strings libarm_compute.so | grep arm_compute_version': arm_compute_version=v18.08 Platform: Raspberry PI ARM v7 Operating System: Distributor ID: Raspbian Description: Raspbian GNU/Linux 9.8 (stretch) Release: 9.8 Codename: stretch

Problem description: I tried using arm_compute::NEConvolutionLayer to speed up inference time, but arm_compute::NEConvolutionLayer::configure throws std::bad_alloc.

`arm_compute::Tensor input_arm; arm_compute::Tensor weights_arm; arm_compute::Tensor bias_arm; arm_compute::Tensor output_arm; arm_compute::NEConvolutionLayer conv_arm;

  output_arm.allocator()->init(arm_compute::TensorInfo(arm_compute::TensorShape(output->dims->data[0],
                    output->dims->data[1],
                                              output->dims->data[2],
                                              output->dims->data[3]), 1, arm_compute::DataType::F32));

  input_arm.allocator()->init(arm_compute::TensorInfo(arm_compute::TensorShape(input->dims->data[0],
                        input->dims->data[1],
                                                  input->dims->data[2],
                                                  input->dims->data[3]), 1, arm_compute::DataType::F32));
  weights_arm.allocator()->init(arm_compute::TensorInfo(arm_compute::TensorShape(filter->dims->data[1],
                                                   filter->dims->data[2],
                                                   filter->dims->data[3],
                                                   filter->dims->data[0]), 1, arm_compute::DataType::F32));
  bias_arm.allocator()->init(arm_compute::TensorInfo(arm_compute::TensorShape(filter->dims->data[0]), 1, arm_compute::DataType::F32));

  input_arm.allocator()->import_memory(GetTensorData<float>(input), input->bytes);
  weights_arm.allocator()->import_memory(GetTensorData<float>(hwcn_weights), hwcn_weights->bytes);
  bias_arm.allocator()->import_memory(GetTensorData<float>(bias), bias->bytes);

  output_arm.allocator()->import_memory(GetTensorData<float>(output), output->bytes);
  conv_arm.configure(&input_arm, &weights_arm, &bias_arm, &output_arm, arm_compute::PadStrideInfo(params->stride_width,
                                                params->stride_height,
                                                data->padding.width, data->padding.height),
                            arm_compute::WeightsInfo(true, filter->dims->data[1], filter->dims->data[2], filter->dims->data[0], true), arm_compute::Size2D(1U, 1U),
                            arm_compute::ActivationLayerInfo(arm_compute::ActivationLayerInfo::ActivationFunction::LU_BOUNDED_RELU,
                                                            output_activation_min, output_activation_max));
  conv_arm.run();`

Valgrind gives me this: ==19553== Argument 'size' of function __builtin_vec_new has a fishy (possibly negative) value: -666472 ==19553== at 0x48485F0: operator new[](unsigned int) (vg_replace_malloc.c:417) ==19553== by 0xDAA79: arm_compute::support::cpp14::_Unique_if::_Single_object arm_compute::support::cpp14::make_unique<arm_compute::MemoryRegion, unsigned int, unsigned int>(unsigned int&&, unsigned int&&) (in /home/pi/EDL/inference_template)

morgolock commented 5 years ago

Hi @paulgheorghecristian

Could you please rebuild with debug=1 and share the callstack throwing the exception?

paulgheorghecristian commented 5 years ago

Hello and thank you for your reply!

I have passed the bad_alloc problem. Now I have a segmentation fault with the following call stack:

0 0x76e783e2 in std::unique_ptr<arm_compute::IMemoryRegion, std::default_delete >::get (this=0x0) at /usr/arm-linux-gnueabihf/include/c++/5/bits/unique_ptr.h:305

1 0x76e77ede in arm_compute::BlobMemoryPool::acquire (this=0x271668, handles=std::map with 1 elements = {...}) at src/runtime/BlobMemoryPool.cpp:55

2 0x00113dd0 in arm_compute::MemoryGroupBase::acquire (this=0x273470) at arm_compute/runtime/MemoryGroupBase.h:140

3 0x001117fc in tflite::ops::builtin::conv::EvalFloat<(tflite::ops::builtin::conv::KernelType)2> (context=0x26bd40, node=0x26d5e0, params=0x2726f0, data=0x26a7e0, input=0x2766a8, filter=0x274698,

bias=0x274638, im2col=0x2766d8, hwcn_weights=0x276708, output=0x274668) at tensorflow/contrib/lite/kernels/conv.cc:515

4 0x0010e4ce in tflite::ops::builtin::conv::Eval<(tflite::ops::builtin::conv::KernelType)2> (context=0x26bd40, node=0x26d5e0) at tensorflow/contrib/lite/kernels/conv.cc:576

I guess I do not know how to use your convolution(NEConvolutionLayer) with Tensor::alocator()->import_memory();

paulgheorghecristian commented 5 years ago

Forget the last comment, I have arrived at a segmentation fault which seems to be because of the framework: In NEGEMMConvolutionLayer.cpp at line 296: configure_mm(gemm_input_to_use, &_weights_reshaped, gemm_output_to_use, gemm_3d_depth);

Tensor _weights_reshaped seems to have the buffer NULL after calling: _reshape_weights.configure(weights, biases_to_use, &_weights_reshaped);

morgolock commented 5 years ago

Hi @paulgheorghecristian

Could you please try updating to 19.02? It would be helpful to see your cpp file too.

paulgheorghecristian commented 5 years ago

On 19.02 I have the same issue. This is the full stack trace:

0 0x76f1f168 in TransposeInterleaveCommon<16u, unsigned short, unsigned short>::moveblock_1x4 (in0=@0x7effdcb0: 0x0, in1=@0x7effdcb4: 0x80, in2=@0x7effdcb8: 0x100, in3=@0x7effdcbc: 0x180, out=0x284e80)

at ./src/core/NEON/kernels/assembly/../arm_gemm/transforms/a32_transpose_interleave_8way_32bit.hpp:115

1 0x76f1fcca in TransposeInterleaveCommon<16u, unsigned short, unsigned short>::Transform (out=0x284e80, in=0x0, stride=64, x0=0, xmax=64, k0=0, kmax=28)

at ./src/core/NEON/kernels/assembly/../arm_gemm/transforms/transpose_interleave_common.hpp:85

2 0x76f1f1de in TransformImpl<16u, 1u, true, 2u, 2u, false>::Transform (out=0x284e80, in=0x0, stride=64, x0=0, xmax=64, k0=0, kmax=28)

at ./src/core/NEON/kernels/assembly/../arm_gemm/transforms/a32_transpose_interleave_8way_32bit.hpp:124

3 0x76f27048 in TransformImpl<8u, 1u, true, 4u, 4u, false>::Transform (out=0x284e80, in=0x0, stride=32, x0=0, xmax=32, k0=0, kmax=28)

at ./src/core/NEON/kernels/assembly/../arm_gemm/transforms/a32_transpose_interleave_8way_32bit.hpp:38

4 0x76f2667a in Transform<8u, 1u, true, false, float, float> (out=0x284e80, in=0x0, stride=32, k0=0, kmax=32, x0=0, xmax=28) at ./src/core/NEON/kernels/assembly/../arm_gemm/transform.hpp:110

5 0x76f25a3c in arm_gemm::StdTransformsFixed<float, float, 6u, 8u, 1u>::PrepareB (this=0x7effddec, out=0x284e80, in=0x0, stride=32, x0=0, xmax=32, k0=0, kmax=28, transposed=false)

at ./src/core/NEON/kernels/assembly/../arm_gemm/kernels/../std_transforms_fixed.hpp:59

6 0x76f251fc in arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::transform (this=0x282c70, wl=..., info=...)

at ./arm_compute/core/NEON/kernels/assembly/NEGEMMInterleavedPrepareBWrapperKernel.h:213

7 0x76f25084 in arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}::operator()(arm_compute::PrepareBWorkload&&) const (__closure=0x7effe038, wl=...) at ./arm_compute/core/NEON/kernels/assembly/NEGEMMInterleavedPrepareBWrapperKernel.h:230

8 0x76f258f4 in void arm_compute::detail::for_each_element_in_window<arm_gemm::sgemm_8x6, false, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}>(arm_compute::Window const&, arm_compute::ITensor const, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}, unsigned int, unsigned int, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}&&)::{lambda(arm_compute::Coordinates const&)#1}::operator()(arm_compute::Coordinates const) const (__closure=0x7effdfe8, coordinates=...) at ./arm_compute/core/NEON/kernels/assembly/NEGEMMInterleavedPrepareBWrapperKernel.h:95

9 0x76f28b90 in arm_compute::ForEachDimension<0u>::unroll<void arm_compute::detail::for_each_element_in_window<arm_gemm::sgemm_8x6, false, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}>(arm_compute::Window const&, arm_compute::ITensor const, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}, unsigned int, unsigned int, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}&&)::{lambda(arm_compute::Coordinates const&)#1}&> (w=..., id=..., lambda_function=...) at ./arm_compute/core/Helpers.inl:117

10 0x76f289ea in arm_compute::ForEachDimension<1u>::unroll<void arm_compute::detail::for_each_element_in_window<arm_gemm::sgemm_8x6, false, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}>(arm_compute::Window const&, arm_compute::ITensor const, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}, unsigned int, unsigned int, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}&&)::{lambda(arm_compute::Coordinates const&)#1}&> (w=..., id=..., lambda_function=...) at ./arm_compute/core/Helpers.inl:106

11 0x76f286d6 in arm_compute::ForEachDimension<2u>::unroll<void arm_compute::detail::for_each_element_in_window<arm_gemm::sgemm_8x6, false, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}>(arm_compute::Window const&, arm_compute::ITensor const, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}, unsigned int, unsigned int, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}&&)::{lambda(arm_compute::Coordinates const&)#1}&> (w=..., id=..., lambda_function=...) at ./arm_compute/core/Helpers.inl:106

12 0x76f28336 in arm_compute::ForEachDimension<3u>::unroll<void arm_compute::detail::for_each_element_in_window<arm_gemm::sgemm_8x6, false, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}>(arm_compute::Window const&, arm_compute::ITensor const, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}, unsigned int, unsigned int, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}&&)::{lambda(arm_compute::Coordinates const&)#1}&> (w=..., id=..., lambda_function=...) at ./arm_compute/core/Helpers.inl:106

13 0x76f27f4e in arm_compute::ForEachDimension<4u>::unroll<void arm_compute::detail::for_each_element_in_window<arm_gemm::sgemm_8x6, false, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}>(arm_compute::Window const&, arm_compute::ITensor const, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}, unsigned int, unsigned int, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}&&)::{lambda(arm_compute::Coordinates const&)#1}&> (w=..., id=..., lambda_function=...) at ./arm_compute/core/Helpers.inl:106

14 0x76f27932 in arm_compute::ForEachDimension<5u>::unroll<void arm_compute::detail::for_each_element_in_window<arm_gemm::sgemm_8x6, false, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}>(arm_compute::Window const&, arm_compute::ITensor const, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}, unsigned int, unsigned int, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}&&)::{lambda(arm_compute::Coordinates const&)#1}&> (w=..., id=..., lambda_function=...) at ./arm_compute/core/Helpers.inl:106

15 0x76f26fca in arm_compute::ForEachDimension<6u>::unroll<void arm_compute::detail::for_each_element_in_window<arm_gemm::sgemm_8x6, false, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}>(arm_compute::Window const&, arm_compute::ITensor const, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}, unsigned int, unsigned int, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}&&)::{lambda(arm_compute::Coordinates const---Type to continue, or q to quit---

&)#1}> (w=..., id=..., lambda_function=...) at ./arm_compute/core/Helpers.inl:106

16 0x76f265f6 in arm_compute::execute_window_loop<void arm_compute::detail::for_each_element_in_window<arm_gemm::sgemm_8x6, false, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}>(arm_compute::Window const&, arm_compute::ITensor const, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}, unsigned int, unsigned int, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}&&)::{lambda(arm_compute::Coordinates const&)#1}>(arm_compute::Window const&, void arm_compute::detail::for_each_element_in_window<arm_gemm::sgemm_8x6, false, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}>(arm_compute::Window const&, arm_compute::ITensor const, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}, unsigned int, unsigned int, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}&&)::{lambda(arm_compute::Coordinates const&)#1}&&) (w=...,

lambda_function=...) at ./arm_compute/core/Helpers.inl:132

17 0x76f259d4 in arm_compute::detail::for_each_element_in_window<arm_gemm::sgemm_8x6, false, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}>(arm_compute::Window const&, arm_compute::ITensor const, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}, unsigned int, unsigned int, arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run(arm_compute::Window const&, arm_compute::ThreadInfo const&)::{lambda(arm_compute::PrepareBWorkload&&)#1}&&) (window=..., b=0x27fe10, transformed_b=0x2999ac, N=32, K=28, lambda=...)

at ./arm_compute/core/NEON/kernels/assembly/NEGEMMInterleavedPrepareBWrapperKernel.h:77

18 0x76f25104 in arm_compute::NEGEMMInterleavedPrepareBWrapperKernelTemplate::run (this=0x282c70, window=..., info=...)

at ./arm_compute/core/NEON/kernels/assembly/NEGEMMInterleavedPrepareBWrapperKernel.h:228

19 0x76ea3c08 in arm_compute::CPPScheduler::schedule (this=0x76f64860 <arm_compute::CPPScheduler::get()::scheduler>, kernel=0x282c70, hints=...) at src/runtime/CPP/CPPScheduler.cpp:292

20 0x76f1cab0 in arm_compute::NEGEMMInterleavedWrapper::prepare (this=0x2998e0) at src/runtime/NEON/functions/assembly/NEGEMMInterleavedWrapper.cpp:198

21 0x76ece53c in arm_compute::NEGEMMAssemblyDispatch::prepare (this=0x27f53c) at src/runtime/NEON/functions/NEGEMMAssemblyDispatch.cpp:362

22 0x76ecd1ee in arm_compute::NEGEMM::prepare (this=0x27f408) at src/runtime/NEON/functions/NEGEMM.cpp:283

23 0x76ecd02a in arm_compute::NEGEMM::run (this=0x27f408) at src/runtime/NEON/functions/NEGEMM.cpp:239

24 0x76ed51b2 in arm_compute::NEGEMMConvolutionLayer::run (this=0x27f2e8) at src/runtime/NEON/functions/NEGEMMConvolutionLayer.cpp:596

25 0x76eb5bce in arm_compute::NEConvolutionLayer::run (this=0x271458) at src/runtime/NEON/functions/NEConvolutionLayer.cpp:162

The code is(I first want to make it work without importing memory, that's why the import code is commented): ` arm_compute::Tensor input_arm; arm_compute::Tensor weights_arm; arm_compute::Tensor bias_arm; arm_compute::Tensor output_arm; arm_compute::Allocator allocator{}; std::unique_ptr memory_group0{}; std::unique_ptr memory_group1{}; std::unique_ptr conv_arm{}; auto lifetime_mgr0 = std::make_shared();
auto lifetime_mgr1 = std::make_shared();
auto pool_mgr0 = std::make_shared();
auto pool_mgr1 = std::make_shared();
auto mm_layers = std::make_shared(lifetime_mgr0, pool_mgr0); auto mm_transitions = std::make_shared(lifetime_mgr1, pool_mgr1);
conv_arm = arm_compute::support::cpp14::make_unique(mm_layers);

  memory_group0 = arm_compute::support::cpp14::make_unique<arm_compute::MemoryGroup>(mm_transitions);
  memory_group1 = arm_compute::support::cpp14::make_unique<arm_compute::MemoryGroup>(mm_transitions);

  arm_compute::TensorInfo output_info(arm_compute::TensorShape(
        output->dims->data[3],
                    output->dims->data[2],
        output->dims->data[1],
        output->dims->data[0]), 1, arm_compute::DataType::F32);

  output_info.set_data_layout(arm_compute::DataLayout::NHWC);

  arm_compute::TensorInfo input_info(arm_compute::TensorShape(
          input->dims->data[3],
          input->dims->data[2],
                        input->dims->data[1],
          input->dims->data[0]), 1, arm_compute::DataType::F32);

  input_info.set_data_layout(arm_compute::DataLayout::NHWC);
  arm_compute::TensorInfo weights_info(arm_compute::TensorShape(filter->dims->data[3],
                                                   filter->dims->data[2],
                                                   filter->dims->data[1],
                                                   filter->dims->data[0]), 1, arm_compute::DataType::F32);

  weights_info.set_data_layout(arm_compute::DataLayout::NHWC);

  arm_compute::TensorInfo bias_info(arm_compute::TensorShape(filter->dims->data[0]),
                                                              1, arm_compute::DataType::F32);

  output_arm.allocator()->init(arm_compute::TensorInfo(output_info));
  input_arm.allocator()->init(arm_compute::TensorInfo(input_info));
  weights_arm.allocator()->init(arm_compute::TensorInfo(weights_info));
  bias_arm.allocator()->init(arm_compute::TensorInfo(bias_info));

        /*
  input_arm.allocator()->import_memory(GetTensorData<float>(input), input_info.total_size());
  weights_arm.allocator()->import_memory(GetTensorData<float>(filter), weights_info.total_size());
  bias_arm.allocator()->import_memory(GetTensorData<float>(bias), bias_info.total_size());
  output_arm.allocator()->import_memory(GetTensorData<float>(output), output_info.total_size());
        */

  conv_arm->configure(&input_arm, &weights_arm, &bias_arm, &output_arm, arm_compute::PadStrideInfo(params->stride_width,
                                                params->stride_height,
                                                data->padding.width, data->padding.height, arm_compute::DimensionRoundingType::CEIL),
                            arm_compute::WeightsInfo(false, filter->dims->data[2], filter->dims->data[1], filter->dims->data[0], true), arm_compute::Size2D(params->dilation_width_factor,
      params->dilation_height_factor),
                            arm_compute::ActivationLayerInfo(arm_compute::ActivationLayerInfo::ActivationFunction::LU_BOUNDED_RELU,
                                                            output_activation_max, output_activation_min));

  input_info.set_num_channels(4);
  weights_info.set_num_channels(4);
  output_info.set_num_channels(4);
  bias_info.set_num_channels(1);

  memory_group0->manage(&output_arm);
  output_arm.allocator()->allocate();
  memory_group1->manage(&input_arm);
  input_arm.allocator()->allocate();

  weights_arm.allocator()->allocate();
  bias_arm.allocator()->allocate();

  mm_layers->populate(allocator, 1);
  mm_transitions->populate(allocator, 2);

  memory_group0->acquire();
  memory_group1->acquire();

  conv_arm->run();

  memory_group0->release();
  memory_group1->release();`
paulgheorghecristian commented 5 years ago

Ok, I managed to get past all problems and accuracy is the same. Unfortunately it is slower than TF Lite's conv by about 80 ms. Is this expected? (both on 4 threads) (GEMM gives me 300 ms and acc is the same, DIRECT gives me 179 ms, but accuracy is 0 instead of 75) (on version 19.02)

morgolock commented 5 years ago

Hi @paulgheorghecristian

Could please share the shapes you use to configure the convolution layer. For FP32 on armv8a I would expect ACL to give up to a 2x speedup for most networks when comparing with TF cpu. You are running this on armv7 and that might not be as optimised as our v8 path.

paulgheorghecristian commented 5 years ago

Yes, I am using armv7. The shapes are that of MobileNetV2 and all of them have the NHWC format.

For example, the first convolution is: output 32 112 112 1 input 3 224 224 1 filter 3 3 3 32 (these are the exact shapes fed into ACL)

paulgheorghecristian commented 5 years ago

I ran it also on armv8 and it's still slower....maybe I'm doing something wrong?

paulgheorghecristian commented 5 years ago

I have a question, do you optimize 1x1 convs in some way?

paulgheorghecristian commented 5 years ago

Ok, it now runs faster than TF Lite with about 50 ms. Sorry for the spam and thank you for this wonderful library.

paulgheorghecristian commented 5 years ago

Ok, it now runs faster than TF Lite with about 50 ms. Sorry for the spam and thank you for this wonderful library.