Closed iam10010 closed 6 years ago
Hello @ymbaek, I am not quite sure why this happens, or how you measure the execution time in each case. If you provide the code of both examples we can have a look.
Hello, @GeorgeARM, Thank you for helping. The below is my graprh ex code. I used ACL v18.01
...
graph << target_hint
<< convolution_hint
<< Tensor(TensorInfo(TensorShape(288U, 288U, 3U, 1U), 1, DataType::F32), DummyAccessor())
<< ConvolutionMethodHint::DIRECT
// Layer 1
<< ConvolutionLayer(
3U, 3U, 8U,
get_weights_accessor(data_path, "whatever.npy"),
get_weights_accessor(data_path, "whatever.npy"),
PadStrideInfo(1, 1, 1, 1))
<< ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::LEAKY_RELU))
<< PoolingLayer(PoolingLayerInfo(PoolingType::MAX, 2, PadStrideInfo(2, 2, 0, 0)))
// Layer 2
<< ConvolutionLayer(
3U, 3U, 16U,
get_weights_accessor(data_path, "whatever.npy"),
get_weights_accessor(data_path, "whatever.npy"),
PadStrideInfo(1, 1, 1, 1))
<< ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::LEAKY_RELU))
<< PoolingLayer(PoolingLayerInfo(PoolingType::MAX, 2, PadStrideInfo(2, 2, 0, 0)))
// Layer 3./neo
<< ConvolutionLayer(
3U, 3U, 32U,
get_weights_accessor(data_path, "whatever.npy"),
get_weights_accessor(data_path, "whatever.npy"),
PadStrideInfo(1, 1, 1, 1))
<< ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::LEAKY_RELU))
<< PoolingLayer(PoolingLayerInfo(PoolingType::MAX, 2, PadStrideInfo(2, 2, 0, 0)))
// Layer 4
<< ConvolutionLayer(
3U, 3U, 64U,
get_weights_accessor(data_path, "whatever.npy"),
get_weights_accessor(data_path, "whatever.npy"),
PadStrideInfo(1, 1, 1, 1))
<< ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::LEAKY_RELU))
<< PoolingLayer(PoolingLayerInfo(PoolingType::MAX, 2, PadStrideInfo(2, 2, 0, 0)))
// Layer 5
<< ConvolutionLayer(
3U, 3U, 128U,
get_weights_accessor(data_path, "whatever.npy"),
get_weights_accessor(data_path, "whatever.npy"),
PadStrideInfo(1, 1, 1, 1))
<< ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::LEAKY_RELU))
<< PoolingLayer(PoolingLayerInfo(PoolingType::MAX, 2, PadStrideInfo(2, 2, 0, 0)))
// Layer 6
<< ConvolutionLayer(
3U, 3U, 256U,
get_weights_accessor(data_path, "whatever.npy"),
get_weights_accessor(data_path, "whatever.npy"),
PadStrideInfo(1, 1, 1, 1))
<< ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::LEAKY_RELU))
// << PoolingLayer(PoolingLayerInfo(PoolingType::MAX, 2, PadStrideInfo(1, 1, 0, 0)))
// Layer 7
<< ConvolutionLayer(
3U, 3U, 512U,
get_weights_accessor(data_path, "whatever.npy"),
get_weights_accessor(data_path, "whatever.npy"),
PadStrideInfo(1, 1, 1, 1))
<< ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::LEAKY_RELU))
// Layer 8
<< ConvolutionLayer(
3U, 3U, 256U,
get_weights_accessor(data_path, "whatever.npy"),
get_weights_accessor(data_path, "whatever.npy"),
PadStrideInfo(1, 1, 1, 1))
<< ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::LEAKY_RELU))
// Layer 9
<< ConvolutionLayer(
1U, 1U, 30U,
get_weights_accessor(data_path, "whatever.npy"),
get_weights_accessor(data_path, "whatever.npy"),
PadStrideInfo(1, 1, 0, 0))
<< ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::LEAKY_RELU))
<< Tensor(DummyAccessor());
.
.
void do_run() override
{
// Run graph
double s, d;
for(int i = 0; i < 10; i++){
s = now_ms();
graph.run();
d = now_ms() - s;
std::cout << d << "ms\n";\
}
}
static double now_ms(void){
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec*1000. + tv.tv_usec/1000.;
}
here is my neon_ex code.
void do_setup(int argc, char **argv) override
{
ARM_COMPUTE_UNUSED(argc);
ARM_COMPUTE_UNUSED(argv);
// Create memory manager components
// We need 2 memory managers: 1 for handling the tensors within the functions (mm_layers) and 1 for handling the input and output tensors of the functions (mm_transitions))
auto lifetime_mgr0 = std::make_shared<BlobLifetimeManager>(); // Create lifetime manager
auto lifetime_mgr1 = std::make_shared<BlobLifetimeManager>(); // Create lifetime manager
auto pool_mgr0 = std::make_shared<PoolManager>(); // Create pool manager
auto pool_mgr1 = std::make_shared<PoolManager>(); // Create pool manager
auto mm_layers = std::make_shared<MemoryManagerOnDemand>(lifetime_mgr0, pool_mgr0); // Create the memory manager
auto mm_transitions = std::make_shared<MemoryManagerOnDemand>(lifetime_mgr1, pool_mgr1); // Create the memory manager
// The weights and biases tensors should be initialized with the values inferred with the training
// Set memory manager where allowed to manage internal memory requirements
conv0 = arm_compute::support::cpp14::make_unique<NEConvolutionLayer>(mm_layers);
conv1 = arm_compute::support::cpp14::make_unique<NEConvolutionLayer>(mm_layers);
conv2 = arm_compute::support::cpp14::make_unique<NEConvolutionLayer>(mm_layers);
conv3 = arm_compute::support::cpp14::make_unique<NEConvolutionLayer>(mm_layers);
conv4 = arm_compute::support::cpp14::make_unique<NEConvolutionLayer>(mm_layers);
conv5 = arm_compute::support::cpp14::make_unique<NEConvolutionLayer>(mm_layers);
conv6 = arm_compute::support::cpp14::make_unique<NEConvolutionLayer>(mm_layers);
conv7 = arm_compute::support::cpp14::make_unique<NEConvolutionLayer>(mm_layers);
conv8 = arm_compute::support::cpp14::make_unique<NEConvolutionLayer>(mm_layers);
/* [Initialize tensors] */
// Initialize src tensor
constexpr unsigned int width_src_image = 288;
constexpr unsigned int height_src_image = 288;
constexpr unsigned int ifm_src_img = 3;
const TensorShape src_shape(width_src_image, height_src_image, ifm_src_img);
src.allocator()->init(TensorInfo(src_shape, 1, DataType::F32));
// Initialize tensors of conv0
constexpr unsigned int kernel_x_conv0 = 3;
constexpr unsigned int kernel_y_conv0 = 3;
constexpr unsigned int ofm_conv0 = 8;
const TensorShape weights_shape_conv0(kernel_x_conv0, kernel_y_conv0, src_shape.z(), ofm_conv0);
const TensorShape biases_shape_conv0(weights_shape_conv0[3]);
const TensorShape out_shape_conv0(src_shape.x(), src_shape.y(), weights_shape_conv0[3]);
weights0.allocator()->init(TensorInfo(weights_shape_conv0, 1, DataType::F32));
biases0.allocator()->init(TensorInfo(biases_shape_conv0, 1, DataType::F32));
out_conv0.allocator()->init(TensorInfo(out_shape_conv0, 1, DataType::F32));
// Initialize tensors of batch0
const TensorShape fm_shape_batch0(out_shape_conv0.z());
mean0.allocator()->init(TensorInfo(fm_shape_batch0, 1, DataType::F32));
var0.allocator()->init(TensorInfo(fm_shape_batch0, 1, DataType::F32));
gamma0.allocator()->init(TensorInfo(fm_shape_batch0, 1, DataType::F32));
beta0.allocator()->init(TensorInfo(fm_shape_batch0, 1, DataType::F32));
out_batch0.allocator()->init(TensorInfo(out_shape_conv0, 1, DataType::F32));
// Initialize tensor of act0
out_act0.allocator()->init(TensorInfo(out_shape_conv0, 1, DataType::F32));
// Initialize tensor of pool0
TensorShape out_shape_pool0 = out_shape_conv0;
out_shape_pool0.set(0, out_shape_pool0.x() / 2);
out_shape_pool0.set(1, out_shape_pool0.y() / 2);
out_pool0.allocator()->init(TensorInfo(out_shape_pool0, 1, DataType::F32));
// Initialize tensors of conv1
constexpr unsigned int kernel_x_conv1 = 3;
constexpr unsigned int kernel_y_conv1 = 3;
constexpr unsigned int ofm_conv1 = 16;
const TensorShape weights_shape_conv1(kernel_x_conv1, kernel_y_conv1, out_shape_pool0.z(), ofm_conv1);
const TensorShape biases_shape_conv1(weights_shape_conv1[3]);
const TensorShape out_shape_conv1(out_shape_pool0.x(), out_shape_pool0.y(), weights_shape_conv1[3]);
weights1.allocator()->init(TensorInfo(weights_shape_conv1, 1, DataType::F32));
biases1.allocator()->init(TensorInfo(biases_shape_conv1, 1, DataType::F32));
out_conv1.allocator()->init(TensorInfo(out_shape_conv1, 1, DataType::F32));
// Initialize tensors of batch1
const TensorShape fm_shape_batch1(out_shape_conv1.z());
mean1.allocator()->init(TensorInfo(fm_shape_batch1, 1, DataType::F32));
var1.allocator()->init(TensorInfo(fm_shape_batch1, 1, DataType::F32));
gamma1.allocator()->init(TensorInfo(fm_shape_batch1, 1, DataType::F32));
beta1.allocator()->init(TensorInfo(fm_shape_batch1, 1, DataType::F32));
out_batch1.allocator()->init(TensorInfo(out_shape_conv1, 1, DataType::F32));
// Initialize tensor of act1
out_act1.allocator()->init(TensorInfo(out_shape_conv1, 1, DataType::F32));
// Initialize tensor of pool1
TensorShape out_shape_pool1 = out_shape_conv1;
out_shape_pool1.set(0, out_shape_pool1.x() / 2);
out_shape_pool1.set(1, out_shape_pool1.y() / 2);
out_pool1.allocator()->init(TensorInfo(out_shape_pool1, 1, DataType::F32));
.
.
.
// Initialize tensors of conv8
constexpr unsigned int kernel_x_conv8 = 1;
constexpr unsigned int kernel_y_conv8 = 1;
constexpr unsigned int ofm_conv8 = 30;
const TensorShape weights_shape_conv8(kernel_x_conv8, kernel_y_conv8, out_shape_conv7.z(), ofm_conv8);
const TensorShape biases_shape_conv8(weights_shape_conv8[3]);
const TensorShape out_shape_conv8(out_shape_conv7.x(), out_shape_conv7.y(), weights_shape_conv8[3]);
weights8.allocator()->init(TensorInfo(weights_shape_conv8, 1, DataType::F32));
biases8.allocator()->init(TensorInfo(biases_shape_conv8, 1, DataType::F32));
out_conv8.allocator()->init(TensorInfo(out_shape_conv8, 1, DataType::F32));
// Initialize tensor of act8
out_act8.allocator()->init(TensorInfo(out_shape_conv8, 1, DataType::F32));
/* -----------------------End: [Initialize tensors] */
/* [Configure functions] */
// in:288x288x3: 3x3 convolution, 8 output features maps (OFM)
conv0->configure(&src, &weights0, &biases0, &out_conv0, PadStrideInfo(1 /* stride_x */, 1 /* stride_y */, 1 /* pad_x */, 1 /* pad_y */));
// in:288x288x8, out:288x288x8, Batch Normalization
batch0.configure(&out_conv0, &out_batch0, &mean0, &var0, &beta0, &gamma0, 0.0001f);
// in:288x288x8, out:288x288x8, Activation function: leaky relu
act0.configure(&out_batch0, &out_act0, ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::LEAKY_RELU)); //need to check Leaky relu speed
// in:288x288x8, out:144x144x8 (2x2 pooling), Pool type function: Max
pool0.configure(&out_act0, &out_pool0, PoolingLayerInfo(PoolingType::MAX, 2, PadStrideInfo(2 /* stride_x */, 2 /* stride_y */)));
.
.
.
// in:9x9x256: 1x1 convolution, 30 output features maps (OFM)
conv8->configure(&out_act7, &weights8, &biases8, &out_conv8, PadStrideInfo(1, 1, 0, 0));
// in:9x9x30, out:9x9x30, Activation function: leaky relu
act8.configure(&out_conv8, &out_act8, ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::LEAKY_RELU));
/* -----------------------End: [Configure functions] */
/*[ Add tensors to memory manager ]*/
// We need 2 memory groups for handling the input and output
// We call explicitly allocate after manage() in order to avoid overlapping lifetimes
memory_group0 = arm_compute::support::cpp14::make_unique<MemoryGroup>(mm_transitions);
memory_group1 = arm_compute::support::cpp14::make_unique<MemoryGroup>(mm_transitions);
memory_group0->manage(&out_conv0);
out_conv0.allocator()->allocate();
memory_group1->manage(&out_batch0);
out_batch0.allocator()->allocate();
memory_group0->manage(&out_act0);
out_act0.allocator()->allocate();
memory_group1->manage(&out_pool0);
out_pool0.allocator()->allocate();
.
.
.
memory_group1->manage(&out_conv8);
out_conv8.allocator()->allocate();
memory_group0->manage(&out_act8);
out_act8.allocator()->allocate();
/* -----------------------End: [ Add tensors to memory manager ] */
/* [Allocate tensors] */
// Now that the padding requirements are known we can allocate all tensors
src.allocator()->allocate();
weights0.allocator()->allocate(); biases0.allocator()->allocate();
.
.
.
weights8.allocator()->allocate(); biases8.allocator()->allocate();
mean0.allocator()->allocate(); var0.allocator()->allocate(); beta0.allocator()->allocate(); gamma0.allocator()->allocate();
.
.
.
mean7.allocator()->allocate(); var7.allocator()->allocate(); beta7.allocator()->allocate(); gamma7.allocator()->allocate();
/* -----------------------End: [Allocate tensors] */
// Finalize layers memory manager
// Set allocator that the memory manager will use
mm_layers->set_allocator(&allocator);
// Number of pools that the manager will create. This specifies how many layers you want to run in parallel
mm_layers->set_num_pools(1);
// Finalize the manager. (Validity checks, memory allocations etc)
mm_layers->finalize();
// Finalize transitions memory manager
// Set allocator that the memory manager will use
mm_transitions->set_allocator(&allocator);
// Number of pools that the manager will create. This specifies how many models we can run in parallel.
// Setting to 2 as we need one for the input and one for the output at any given time
mm_transitions->set_num_pools(2);
// Finalize the manager. (Validity checks, memory allocations etc)
mm_transitions->finalize();
}
void do_run() override
{
// Acquire memory for the memory groups
memory_group0->acquire();
memory_group1->acquire();
for(int i = 0; i < 10; i++){
double start = now_ms();
conv0->run();
batch0.run();
act0.run();
pool0.run();
conv1->run();
batch1.run();
act1.run();
pool1.run();
conv2->run();
batch2.run();
act2.run();
pool2.run();
conv3->run();
batch3.run();
act3.run();
pool3.run();
conv4->run();
batch4.run();
act4.run();
pool4.run();
conv5->run();
batch5.run();
act5.run();
conv6->run();
batch6.run();
act6.run();
conv7->run();
batch7.run();
act7.run();
conv8->run();
act8.run();
double duration = now_ms() - start;
std::cout << duration << "ms\n";
}
// Release memory
memory_group0->release();
memory_group1->release();
}
The outputs on my device are below. Graph ex output,
186.173ms
166.912ms
0.00317383ms
170.273ms
166.14ms
0.00292969ms
169.4ms
167.487ms
0.00219727ms
172.424ms
About small computation time, I read #311. Thus, In v17.12 I modified the code follwing here But now I use v18.01 and don't modified anything.
Neon_ex output
312.414ms
203.059ms
201.16ms
200.998ms
231.294ms
215.6ms
207.837ms
208.793ms
210.99ms
208.257ms
Thanks.
Hello @ymbaek,
In the graph API you specify: << ConvolutionMethodHint::DIRECT
, this will essentially use the DirectConvolution function if possilbe (Supports 1x1, 3x3 and 5x5 convolution) to execute the deep convolution instead of using the GEMM approach.
On the other hand on you second example you use explicitly the ConvolutionLayer
which is GEMM based.
Can you align the two programs to use the same convolution functions?
Hello @GeorgeARM,
You're right. Thank you very much for your helping. I've aligned my programs to use the same convolution functions as your advise and then could get the similar performances.
Using DirectConvolution function, I could get about 180ms average elapsed time(10 iter.) for both my examples GraphEx and Non-GraphEx. Using the GEMM approach I could get about 205ms for both examples.
Many thanks.
I implemented some simple network as 2 versions following graph example and neon_cnn example. As my computation time check result, I founded that the version based on neon_cnn ex. was about two times slower than graph api ex. version. I guess that thread can be the reason but I am not sure. What do I miss?
Many Thanks.