performance gain ith ACL

kaishijeng commented 7 years ago

I did performance profiling on classification with BVLC model between original caffe and caffeonacl and saw some gain, but as big as I am hoping. Is this also what you observe on your platform? I use the following command on firefly 3399:

./build/examples/cpp_classification/classification.bin models/bvlc_reference_caffenet/deploy.prototxt models/bvlc_reference_caffene t/bvlc_reference_caffenet.caffemodel data/ilsvrc12/imagenet_mean.binaryproto data/ilsvrc12/synset_words.txt examples/images/cat.jpg

and measure time spent below:

std::vector Classifier::Classify(const cv::Mat& img, int N) { std::vector output = Predict(img); std::clock_t begin = std::clock(); output = Predict(img);

N = std::min(labels_.size(), N); std::vector maxN = Argmax(output, N); std::vector predictions; for (int i = 0; i < N; ++i) { int idx = maxN[i]; predictions.push_back(std::makepair(labels[idx], output[idx])); } std::clock_t end = std::clock(); double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC; std::cout <<"Time spent: " << elapsed_secs <<std::endl;

return predictions; }

The time measurement for Caffe and CaffeOnACL are below:

CaffeonACL Time spent: 4.53536 0.3134 - "n02123045 tabby, tabby cat" 0.2380 - "n02123159 tiger cat" 0.1235 - "n02124075 Egyptian cat" 0.1003 - "n02119022 red fox, Vulpes vulpes" 0.0715 - "n02127052 lynx, catamount"

Original Caffe Time spent: 5.5306 0.3134 - "n02123045 tabby, tabby cat" 0.2380 - "n02123159 tiger cat" 0.1235 - "n02124075 Egyptian cat" 0.1003 - "n02119022 red fox, Vulpes vulpes" 0.0715 - "n02127052 lynx, catamount"

honggui commented 7 years ago

Yes, kaishijeng. The performance gain percentage we got is just like what you got. Due to the time of loading model's parameters, the time in real classification application will be much fatser(only need load the parameters once).

kaishijeng commented 7 years ago

If you see how I measure time duration, I actually measure time starting from the 2nd time of prediction. So loading parameter should not have an effect of my time measurement

Thanks,

On Sat, Jul 1, 2017 at 5:05 PM, honggui notifications@github.com wrote:

Yes, kaishijeng. The performance gain percentage we got is just like what you got. Due to the time of loading model's parameters, the time in real classification application will be much fatser(only need load the parameters once).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/OAID/caffeOnACL/issues/2#issuecomment-312462345, or mute the thread https://github.com/notifications/unsubscribe-auth/AMGg3nEZGz6QmHTMvmal5eq1tIruPMHrks5sJt7OgaJpZM4OLa-n .

honggui commented 7 years ago

Kaishijeng, your measure time is much longer than what I measured. In Arm Compute Library, there's a line "force_number_of_threads(0)" in the file src\runtime\CPP\CPPScheduler.cpp. You may change the line to "force_number_of_threads(1)", and have a try again.

kaishijeng commented 7 years ago

honggui

I can't find force_number_of_threads function on src/runtime/CPP/CPPScheduler.cpp in the computelibrary. Can you check it? I have a couple questions of your measurements: 1) What platform do you use and what is time spent you get on your platform? 2) Which portion of code do you measure time spent? 3) How do I know GU has been used? I modify the .the arm_gpu_mode function in include/caffe/common.hpp below and not sure it is correct or not to force using gpu mode:

//inline static bool arm_gpu_mode() {return Get().use_maligpu;} inline static bool arm_gpu_mode() {return true;}

Thanks,

On Sun, Jul 2, 2017 at 1:15 AM, honggui notifications@github.com wrote:

Kaishijeng, you time is much longer than what I measured. there's a line "force_number_of_threads(0)" in the file src\runtime\CPP\CPPScheduler.cpp. You may change the line to "force_number_of_threads(1)", and have a try again.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/OAID/caffeOnACL/issues/2#issuecomment-312477517, or mute the thread https://github.com/notifications/unsubscribe-auth/AMGg3iSp9i_UcMhOD90AK9f42Y5aOYbYks5sJ1GvgaJpZM4OLa-n .

honggui commented 7 years ago

Hi Kaishijeng， I make a mistake, the line is not in ACL 17.06. You may use CPPScheduler::set_num_threads(1) to have the try. To enable GPU mode with Caffe::set_mode(Caffe::GPU)<see examples/cpp_classification/classification_gpu.cpp as the example> best regards Honggui

kaishijeng commented 7 years ago

Reduce 0.3sec from 4.5 to 4.2 with set_num_threads(1). What numbers do you get in your test?

Thanks,

honggui commented 7 years ago

kaishijeng， the log was listed below:(Include setup time, it's 1.794151s. If not count setup time, it is 0.62415s per forward) Regards, Honggui

firefly@firefly:~/caffeOnACL$ ./build/examples/cpp_classification/classification_profiling.bin models/bvlc_reference_caffenet/deploy.prototxt models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel data/ilsvrc12/imagenet_mean.binaryproto data/ilsvrc12/synset_words.txt examples/images/cat.jpg LOGACL<0> LOGACL: 0 ---------- Prediction for examples/images/cat.jpg ---------- used time: 1795 Input/output shape for each layer ... total: 24

LAYER IDX: 23 name: prob type: Softmax bottom fc8: 1 1000 top prob: 1 1000

LAYER IDX: 22 name: fc8 type: InnerProduct bottom fc7: 1 4096 top fc8: 1 1000

LAYER IDX: 21 name: drop7 type: Dropout bottom fc7: 1 4096 top fc7: 1 4096

LAYER IDX: 20 name: relu7 type: ReLU bottom fc7: 1 4096 top fc7: 1 4096

LAYER IDX: 19 name: fc7 type: InnerProduct bottom fc6: 1 4096 top fc7: 1 4096

LAYER IDX: 18 name: drop6 type: Dropout bottom fc6: 1 4096 top fc6: 1 4096

LAYER IDX: 17 name: relu6 type: ReLU bottom fc6: 1 4096 top fc6: 1 4096

LAYER IDX: 16 name: fc6 type: InnerProduct bottom pool5: 1 256 6 6 top fc6: 1 4096

LAYER IDX: 15 name: pool5 type: Pooling bottom conv5: 1 256 13 13 top pool5: 1 256 6 6

LAYER IDX: 14 name: relu5 type: ReLU bottom conv5: 1 256 13 13 top conv5: 1 256 13 13

LAYER IDX: 13 name: conv5 type: Convolution bottom conv4: 1 384 13 13 top conv5: 1 256 13 13

LAYER IDX: 12 name: relu4 type: ReLU bottom conv4: 1 384 13 13 top conv4: 1 384 13 13

LAYER IDX: 11 name: conv4 type: Convolution bottom conv3: 1 384 13 13 top conv4: 1 384 13 13

LAYER IDX: 10 name: relu3 type: ReLU bottom conv3: 1 384 13 13 top conv3: 1 384 13 13

LAYER IDX: 9 name: conv3 type: Convolution bottom norm2: 1 256 13 13 top conv3: 1 384 13 13

LAYER IDX: 8 name: norm2 type: LRN bottom pool2: 1 256 13 13 top norm2: 1 256 13 13

LAYER IDX: 7 name: pool2 type: Pooling bottom conv2: 1 256 27 27 top pool2: 1 256 13 13

LAYER IDX: 6 name: relu2 type: ReLU bottom conv2: 1 256 27 27 top conv2: 1 256 27 27

LAYER IDX: 5 name: conv2 type: Convolution bottom norm1: 1 96 27 27 top conv2: 1 256 27 27

LAYER IDX: 4 name: norm1 type: LRN bottom pool1: 1 96 27 27 top norm1: 1 96 27 27

LAYER IDX: 3 name: pool1 type: Pooling bottom conv1: 1 96 55 55 top pool1: 1 96 27 27

LAYER IDX: 2 name: relu1 type: ReLU bottom conv1: 1 96 55 55 top conv1: 1 96 55 55

LAYER IDX: 1 name: conv1 type: Convolution bottom data: 1 3 227 227 top conv1: 1 96 55 55

LAYER IDX: 0 name: data type: Input top data: 1 3 227 227 Time for each layer ... sum of all layers is : 1794151

LAYER IDX: 23 name: prob type: Softmax ratio: 0 time stat: total: 0 count: 1 average: 0 start: 597045632 end: 597045632

LAYER IDX: 22 name: fc8 type: InnerProduct ratio: 4.23632 time stat: total: 76006 count: 1 average: 76006 start: 596969626 end: 597045632

LAYER IDX: 21 name: drop7 type: Dropout ratio: 0 time stat: total: 0 count: 1 average: 0 start: 596969626 end: 596969626

LAYER IDX: 20 name: relu7 type: ReLU ratio: 0 time stat: total: 0 count: 1 average: 0 start: 596969626 end: 596969626

LAYER IDX: 19 name: fc7 type: InnerProduct ratio: 20.903 time stat: total: 375031 count: 1 average: 375031 start: 596594595 end: 596969626

LAYER IDX: 18 name: drop6 type: Dropout ratio: 0 time stat: total: 0 count: 1 average: 0 start: 596594595 end: 596594595

LAYER IDX: 17 name: relu6 type: ReLU ratio: 0 time stat: total: 0 count: 1 average: 0 start: 596594595 end: 596594595

LAYER IDX: 16 name: fc6 type: InnerProduct ratio: 42.5307 time stat: total: 763065 count: 1 average: 763065 start: 595831530 end: 596594595

LAYER IDX: 15 name: pool5 type: Pooling ratio: 1.05905 time stat: total: 19001 count: 1 average: 19001 start: 595811528 end: 595830529

LAYER IDX: 14 name: relu5 type: ReLU ratio: 0 time stat: total: 0 count: 1 average: 0 start: 595811528 end: 595811528

LAYER IDX: 13 name: conv5 type: Convolution ratio: 1.61653 time stat: total: 29003 count: 1 average: 29003 start: 595782525 end: 595811528

LAYER IDX: 12 name: relu4 type: ReLU ratio: 0.0557367 time stat: total: 1000 count: 1 average: 1000 start: 595781525 end: 595782525

LAYER IDX: 11 name: conv4 type: Convolution ratio: 2.73132 time stat: total: 49004 count: 1 average: 49004 start: 595732521 end: 595781525

LAYER IDX: 10 name: relu3 type: ReLU ratio: 0.0557367 time stat: total: 1000 count: 1 average: 1000 start: 595731521 end: 595732521

LAYER IDX: 9 name: conv3 type: Convolution ratio: 10.7581 time stat: total: 193016 count: 1 average: 193016 start: 595538505 end: 595731521

LAYER IDX: 8 name: norm2 type: LRN ratio: 0.334476 time stat: total: 6001 count: 1 average: 6001 start: 595532504 end: 595538505

LAYER IDX: 7 name: pool2 type: Pooling ratio: 1.95095 time stat: total: 35003 count: 1 average: 35003 start: 595497501 end: 595532504

LAYER IDX: 6 name: relu2 type: ReLU ratio: 0.222947 time stat: total: 4000 count: 1 average: 4000 start: 595493501 end: 595497501

LAYER IDX: 5 name: conv2 type: Convolution ratio: 8.97438 time stat: total: 161014 count: 1 average: 161014 start: 595332487 end: 595493501

LAYER IDX: 4 name: norm1 type: LRN ratio: 0.390212 time stat: total: 7001 count: 1 average: 7001 start: 595325486 end: 595332487

LAYER IDX: 3 name: pool1 type: Pooling ratio: 1.11484 time stat: total: 20002 count: 1 average: 20002 start: 595305484 end: 595325486

LAYER IDX: 2 name: relu1 type: ReLU ratio: 0.33442 time stat: total: 6000 count: 1 average: 6000 start: 595299484 end: 595305484

LAYER IDX: 1 name: conv1 type: Convolution ratio: 2.73132 time stat: total: 49004 count: 1 average: 49004 start: 595250480 end: 595299484

LAYER IDX: 0 name: data type: Input ratio: 0 time stat: total: 0 count: 1 average: 0 start: 595250480 end: 595250480

STATS for 10 reptitions: ... Total time: 624150 per forward Each layer stats: ... 23: used time: 100 ratio: 0.0160218 enter count: 1 22: used time: 18001 ratio: 2.88416 enter count: 1 21: used time: 0 ratio: 0 enter count: 1 20: used time: 0 ratio: 0 enter count: 1 19: used time: 68005 ratio: 10.8957 enter count: 1 18: used time: 0 ratio: 0 enter count: 1 17: used time: 0 ratio: 0 enter count: 1 16: used time: 181514 ratio: 29.0819 enter count: 1 15: used time: 23601 ratio: 3.78145 enter count: 1 14: used time: 200 ratio: 0.0320596 enter count: 1 13: used time: 22701 ratio: 3.63722 enter count: 1 12: used time: 200 ratio: 0.0320436 enter count: 1 11: used time: 42503 ratio: 6.80979 enter count: 1 10: used time: 400 ratio: 0.0640872 enter count: 1 9: used time: 67305 ratio: 10.7835 enter count: 1 8: used time: 4200 ratio: 0.672979 enter count: 1 7: used time: 26802 ratio: 4.29418 enter count: 1 6: used time: 1100 ratio: 0.17624 enter count: 1 5: used time: 109508 ratio: 17.5453 enter count: 1 4: used time: 5100 ratio: 0.817159 enter count: 1 3: used time: 15501 ratio: 2.48357 enter count: 1 2: used time: 2400 ratio: 0.384587 enter count: 1 1: used time: 35002 ratio: 5.60806 enter count: 1 0: used time: 0 ratio: 0 enter count: 1

time cost top 10 layers are: ... 16: used time: 181514 ratio: 29.0819 enter count: 1 5: used time: 109508 ratio: 17.5453 enter count: 1 19: used time: 68005 ratio: 10.8957 enter count: 1 9: used time: 67305 ratio: 10.7835 enter count: 1 11: used time: 42503 ratio: 6.80979 enter count: 1 1: used time: 35002 ratio: 5.60806 enter count: 1 7: used time: 26802 ratio: 4.29418 enter count: 1 15: used time: 23601 ratio: 3.78145 enter count: 1 13: used time: 22701 ratio: 3.63722 enter count: 1 22: used time: 18001 ratio: 2.88416 enter count: 1 Top cost layers occupied: 95.3213

0.3134 - "n02123045 tabby, tabby cat" 0.2380 - "n02123159 tiger cat" 0.1235 - "n02124075 Egyptian cat" 0.1003 - "n02119022 red fox, Vulpes vulpes" 0.0715 - "n02127052 lynx, catamount"

kaishijeng commented 7 years ago

How do you get the log? I ran the same command below and got only classification result, no profiling log.

./build/examples/cpp_classification/classification_profiling.bin models/bvlc_reference_caffenet/deploy.prototxt models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel data/ilsvrc12/imagenet_mean.binaryproto data/ilsvrc12/synset_words.txt examples/images/cat.jpg

---------- Prediction for examples/images/cat.jpg ---------- 0.3134 - "n02123045 tabby, tabby cat" 0.2380 - "n02123159 tiger cat" 0.1235 - "n02124075 Egyptian cat" 0.1003 - "n02119022 red fox, Vulpes vulpes" 0.0715 - "n02127052 lynx, catamount"

honggui commented 7 years ago

hi kaishijeng, We can change the "USE_PROFILING" in Makefile.config to enable the profiling messages. CPU_ONLY := 1 USE_PROFILING := 1 USE_ACL :=1 Reagards, Honggui

kaishijeng commented 7 years ago

See below is my profiling result which is very similar to you get:

STATS for 10 reptitions: ... Total time: 607204 per forward Each layer stats: ... 23: used time: 4400 ratio: 0.724633 enter count: 1 22: used time: 4800 ratio: 0.790509 enter count: 1 21: used time: 0 ratio: 0 enter count: 1 20: used time: 400 ratio: 0.0658757 enter count: 1 19: used time: 18400 ratio: 3.03028 enter count: 1 18: used time: 0 ratio: 0 enter count: 1 17: used time: 800 ratio: 0.131751 enter count: 1 16: used time: 53200 ratio: 8.76154 enter count: 1 15: used time: 114400 ratio: 18.8406 enter count: 1 14: used time: 2000 ratio: 0.329379 enter count: 1 13: used time: 13600 ratio: 2.23979 enter count: 1 12: used time: 2800 ratio: 0.461147 enter count: 1 11: used time: 16800 ratio: 2.7668 enter count: 1 10: used time: 1200 ratio: 0.197644 enter count: 1 9: used time: 46400 ratio: 7.64165 enter count: 1 8: used time: 34400 ratio: 5.66533 enter count: 1 7: used time: 126800 ratio: 20.8828 enter count: 1 6: used time: 3600 ratio: 0.592882 enter count: 1 5: used time: 55600 ratio: 9.15677 enter count: 1 4: used time: 15200 ratio: 2.50329 enter count: 1 3: used time: 53600 ratio: 8.82741 enter count: 1 2: used time: 4800 ratio: 0.790509 enter count: 1 1: used time: 34000 ratio: 5.59949 enter count: 1 0: used time: 0 ratio: 0 enter count: 1

time cost top 10 layers are: ... 7: used time: 126800 ratio: 20.8828 enter count: 1 15: used time: 114400 ratio: 18.8406 enter count: 1 5: used time: 55600 ratio: 9.15677 enter count: 1 3: used time: 53600 ratio: 8.82741 enter count: 1 16: used time: 53200 ratio: 8.76154 enter count: 1 9: used time: 46400 ratio: 7.64165 enter count: 1 8: used time: 34400 ratio: 5.66533 enter count: 1 1: used time: 34000 ratio: 5.59949 enter count: 1 19: used time: 18400 ratio: 3.03028 enter count: 1 11: used time: 16800 ratio: 2.7668 enter count: 1 Top cost layers occupied: 91.1726

kaishijeng commented 7 years ago

STATS for 10 reptitions: ... Total time: 607204 per forward

Does it mean time per forward is 607msec?

Thanks,

honggui commented 7 years ago

Hi Kaishijeng, Yes，you are right. Regards, Honggui

kaishijeng commented 7 years ago

Honggui

How do you do profile with original caffe so that I can compare performance with caffeOnACL?

Thanks

On Thu, Jul 6, 2017 at 8:30 PM, honggui notifications@github.com wrote:

Hi Kaishijeng, Yes，you are right. Regards, Honggui

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/OAID/caffeOnACL/issues/2#issuecomment-313578009, or mute the thread https://github.com/notifications/unsubscribe-auth/AMGg3vwngpkK70UA2cZtnw3l42V3wQmTks5sLaZBgaJpZM4OLa-n .

austingg commented 7 years ago

@kaishijeng you may use

your/caffe/binary/caffe time -model alexnet.prototxt

@kaishijeng @honggui by the way, you guys test the performance on desktop processor? Is there some statistics from mobile devices? and In the doc, the ACL_NEON seems slower than offical caffe with openblas. Which devices are tested? seems a long way to go if test on 32 bit platform, since 32 bit openblas don't use neon to speed up.

kaishijeng commented 7 years ago

Not sure performance number from the following command is a fair comparison to CaffeOnACL:

I0707 06:58:46.069133 9441 caffe.cpp:417] Average Forward pass: 1580.33 ms.

firefly@firefly:~/2TB/src/caffe$ ./.build_release/tools/caffe time -model models/bvlc_reference_caffenet/deploy.prototxt

I0707 06:56:26.991888 9441 caffe.cpp:352] Use CPU. I0707 06:56:27.031245 9441 net.cpp:51] Initializing net from parameters: name: "CaffeNet" state { phase: TRAIN level: 0 stage: "" } layer { name: "data" type: "Input" top: "data" input_param { shape { dim: 10 dim: 3 dim: 227 dim: 227 } } } layer { name: "conv1" type: "Convolution" bottom: "data" top: "conv1" convolution_param { num_output: 96 kernel_size: 11 stride: 4 } } layer { name: "relu1" type: "ReLU" bottom: "conv1" top: "conv1" } layer { name: "pool1" type: "Pooling" bottom: "conv1" top: "pool1" pooling_param { pool: MAX kernel_size: 3 stride: 2 } } layer { name: "norm1" type: "LRN" bottom: "pool1" top: "norm1" lrn_param { local_size: 5 alpha: 0.0001 beta: 0.75 } } layer { name: "conv2" type: "Convolution" bottom: "norm1" top: "conv2" convolution_param { num_output: 256 pad: 2 kernel_size: 5 group: 2 } } layer { name: "relu2" type: "ReLU" bottom: "conv2" top: "conv2" } layer { name: "pool2" type: "Pooling" bottom: "conv2" top: "pool2" pooling_param { pool: MAX kernel_size: 3 stride: 2 } } layer { name: "norm2" type: "LRN" bottom: "pool2" top: "norm2" lrn_param { local_size: 5 alpha: 0.0001 beta: 0.75 } } layer { name: "conv3" type: "Convolution" bottom: "norm2" top: "conv3" convolution_param { num_output: 384 pad: 1 kernel_size: 3 } } layer { name: "relu3" type: "ReLU" bottom: "conv3" top: "conv3" } layer { name: "conv4" type: "Convolution" bottom: "conv3" top: "conv4" convolution_param { num_output: 384 pad: 1 kernel_size: 3 group: 2 } } layer { name: "relu4" type: "ReLU" bottom: "conv4" top: "conv4" } layer { name: "conv5" type: "Convolution" bottom: "conv4" top: "conv5" convolution_param { num_output: 256 pad: 1 kernel_size: 3 group: 2 } } layer { name: "relu5" type: "ReLU" bottom: "conv5" top: "conv5" } layer { name: "pool5" type: "Pooling" bottom: "conv5" top: "pool5" pooling_param { pool: MAX kernel_size: 3 stride: 2 } } layer { name: "fc6" type: "InnerProduct" bottom: "pool5" top: "fc6" inner_product_param { num_output: 4096 } } layer { name: "relu6" type: "ReLU" bottom: "fc6" top: "fc6" } layer { name: "drop6" type: "Dropout" bottom: "fc6" top: "fc6" dropout_param { dropout_ratio: 0.5 } } layer { name: "fc7" type: "InnerProduct" bottom: "fc6" top: "fc7" inner_product_param { num_output: 4096 } } layer { name: "relu7" type: "ReLU" bottom: "fc7" top: "fc7" } layer { name: "drop7" type: "Dropout" bottom: "fc7" top: "fc7" dropout_param { dropout_ratio: 0.5 } } layer { name: "fc8" type: "InnerProduct" bottom: "fc7" top: "fc8" inner_product_param { num_output: 1000 } } layer { name: "prob" type: "Softmax" bottom: "fc8" top: "prob" }

I0707 06:56:27.254829 9441 caffe.cpp:360] Performing Forward I0707 06:56:28.812577 9441 caffe.cpp:365] Initial loss: 0 I0707 06:56:28.812671 9441 caffe.cpp:366] Performing Backward I0707 06:56:28.812688 9441 caffe.cpp:374] Benchmark begins I0707 06:56:28.812697 9441 caffe.cpp:375] Testing for 50 iterations. I0707 06:56:31.571130 9441 caffe.cpp:403] Iteration: 1 forward-backward time: 2758 ms. I0707 06:56:34.219300 9441 caffe.cpp:403] Iteration: 2 forward-backward time: 2647 ms. I0707 06:56:36.851164 9441 caffe.cpp:403] Iteration: 3 forward-backward time: 2631 ms. I0707 06:56:39.500258 9441 caffe.cpp:403] Iteration: 4 forward-backward time: 2648 ms. I0707 06:56:42.151398 9441 caffe.cpp:403] Iteration: 5 forward-backward time: 2650 ms. I0707 06:56:44.799932 9441 caffe.cpp:403] Iteration: 6 forward-backward time: 2648 ms. I0707 06:56:47.448256 9441 caffe.cpp:403] Iteration: 7 forward-backward time: 2648 ms. I0707 06:56:50.095988 9441 caffe.cpp:403] Iteration: 8 forward-backward time: 2647 ms. I0707 06:56:52.744285 9441 caffe.cpp:403] Iteration: 9 forward-backward time: 2648 ms. I0707 06:56:55.396378 9441 caffe.cpp:403] Iteration: 10 forward-backward time: 2651 ms. I0707 06:56:58.047657 9441 caffe.cpp:403] Iteration: 11 forward-backward time: 2651 ms. I0707 06:57:00.724208 9441 caffe.cpp:403] Iteration: 12 forward-backward time: 2676 ms. I0707 06:57:03.415966 9441 caffe.cpp:403] Iteration: 13 forward-backward time: 2691 ms. I0707 06:57:06.115960 9441 caffe.cpp:403] Iteration: 14 forward-backward time: 2699 ms. I0707 06:57:08.835702 9441 caffe.cpp:403] Iteration: 15 forward-backward time: 2719 ms. I0707 06:57:11.555269 9441 caffe.cpp:403] Iteration: 16 forward-backward time: 2719 ms. I0707 06:57:14.274786 9441 caffe.cpp:403] Iteration: 17 forward-backward time: 2719 ms. I0707 06:57:17.010529 9441 caffe.cpp:403] Iteration: 18 forward-backward time: 2735 ms. I0707 06:57:19.747344 9441 caffe.cpp:403] Iteration: 19 forward-backward time: 2736 ms. I0707 06:57:22.479828 9441 caffe.cpp:403] Iteration: 20 forward-backward time: 2732 ms. I0707 06:57:25.228466 9441 caffe.cpp:403] Iteration: 21 forward-backward time: 2748 ms. I0707 06:57:27.979506 9441 caffe.cpp:403] Iteration: 22 forward-backward time: 2750 ms. I0707 06:57:30.732939 9441 caffe.cpp:403] Iteration: 23 forward-backward time: 2753 ms. I0707 06:57:33.488718 9441 caffe.cpp:403] Iteration: 24 forward-backward time: 2755 ms. I0707 06:57:36.250659 9441 caffe.cpp:403] Iteration: 25 forward-backward time: 2761 ms. I0707 06:57:38.991574 9441 caffe.cpp:403] Iteration: 26 forward-backward time: 2740 ms. I0707 06:57:41.754909 9441 caffe.cpp:403] Iteration: 27 forward-backward time: 2763 ms. I0707 06:57:44.510370 9441 caffe.cpp:403] Iteration: 28 forward-backward time: 2755 ms. I0707 06:57:47.282030 9441 caffe.cpp:403] Iteration: 29 forward-backward time: 2771 ms. I0707 06:57:50.053514 9441 caffe.cpp:403] Iteration: 30 forward-backward time: 2771 ms. I0707 06:57:53.114980 9441 caffe.cpp:403] Iteration: 31 forward-backward time: 3061 ms. I0707 06:57:56.100261 9441 caffe.cpp:403] Iteration: 32 forward-backward time: 2985 ms. I0707 06:57:58.875066 9441 caffe.cpp:403] Iteration: 33 forward-backward time: 2774 ms. I0707 06:58:01.651820 9441 caffe.cpp:403] Iteration: 34 forward-backward time: 2776 ms. I0707 06:58:04.404618 9441 caffe.cpp:403] Iteration: 35 forward-backward time: 2752 ms. I0707 06:58:07.187002 9441 caffe.cpp:403] Iteration: 36 forward-backward time: 2782 ms. I0707 06:58:09.971091 9441 caffe.cpp:403] Iteration: 37 forward-backward time: 2783 ms. I0707 06:58:12.750619 9441 caffe.cpp:403] Iteration: 38 forward-backward time: 2779 ms. I0707 06:58:15.513088 9441 caffe.cpp:403] Iteration: 39 forward-backward time: 2762 ms. I0707 06:58:18.293782 9441 caffe.cpp:403] Iteration: 40 forward-backward time: 2780 ms. I0707 06:58:21.070822 9441 caffe.cpp:403] Iteration: 41 forward-backward time: 2776 ms. I0707 06:58:23.830873 9441 caffe.cpp:403] Iteration: 42 forward-backward time: 2759 ms. I0707 06:58:26.594636 9441 caffe.cpp:403] Iteration: 43 forward-backward time: 2763 ms. I0707 06:58:29.376324 9441 caffe.cpp:403] Iteration: 44 forward-backward time: 2781 ms. I0707 06:58:32.151278 9441 caffe.cpp:403] Iteration: 45 forward-backward time: 2774 ms. I0707 06:58:34.932479 9441 caffe.cpp:403] Iteration: 46 forward-backward time: 2780 ms. I0707 06:58:37.702002 9441 caffe.cpp:403] Iteration: 47 forward-backward time: 2769 ms. I0707 06:58:40.484354 9441 caffe.cpp:403] Iteration: 48 forward-backward time: 2782 ms. I0707 06:58:43.274502 9441 caffe.cpp:403] Iteration: 49 forward-backward time: 2789 ms. I0707 06:58:46.065948 9441 caffe.cpp:403] Iteration: 50 forward-backward time: 2791 ms. I0707 06:58:46.066244 9441 caffe.cpp:406] Average time per layer: I0707 06:58:46.066313 9441 caffe.cpp:409] data forward: 0.00226 ms. I0707 06:58:46.066375 9441 caffe.cpp:412] data backward: 0.0033 ms. I0707 06:58:46.066432 9441 caffe.cpp:409] conv1 forward: 151.357 ms. I0707 06:58:46.066490 9441 caffe.cpp:412] conv1 backward: 134.551 ms. I0707 06:58:46.066547 9441 caffe.cpp:409] relu1 forward: 7.30002 ms. I0707 06:58:46.066602 9441 caffe.cpp:412] relu1 backward: 0.00226 ms. I0707 06:58:46.066658 9441 caffe.cpp:409] pool1 forward: 36.679 ms. I0707 06:58:46.066712 9441 caffe.cpp:412] pool1 backward: 0.0037 ms. I0707 06:58:46.066767 9441 caffe.cpp:409] norm1 forward: 67.7754 ms. I0707 06:58:46.066823 9441 caffe.cpp:412] norm1 backward: 69.7601 ms. I0707 06:58:46.066876 9441 caffe.cpp:409] conv2 forward: 354.68 ms. I0707 06:58:46.066968 9441 caffe.cpp:412] conv2 backward: 339.333 ms. I0707 06:58:46.067028 9441 caffe.cpp:409] relu2 forward: 4.3349 ms. I0707 06:58:46.067081 9441 caffe.cpp:412] relu2 backward: 0.00196 ms. I0707 06:58:46.067137 9441 caffe.cpp:409] pool2 forward: 23.469 ms. I0707 06:58:46.067190 9441 caffe.cpp:412] pool2 backward: 0.00356 ms. I0707 06:58:46.067245 9441 caffe.cpp:409] norm2 forward: 44.1165 ms. I0707 06:58:46.067299 9441 caffe.cpp:412] norm2 backward: 45.2355 ms. I0707 06:58:46.067378 9441 caffe.cpp:409] conv3 forward: 182.216 ms. I0707 06:58:46.067433 9441 caffe.cpp:412] conv3 backward: 146.802 ms. I0707 06:58:46.067489 9441 caffe.cpp:409] relu3 forward: 1.48994 ms. I0707 06:58:46.067543 9441 caffe.cpp:412] relu3 backward: 0.0036 ms. I0707 06:58:46.067597 9441 caffe.cpp:409] conv4 forward: 145.296 ms. I0707 06:58:46.067652 9441 caffe.cpp:412] conv4 backward: 121.937 ms. I0707 06:58:46.067708 9441 caffe.cpp:409] relu4 forward: 1.4964 ms. I0707 06:58:46.067761 9441 caffe.cpp:412] relu4 backward: 0.00316 ms. I0707 06:58:46.067816 9441 caffe.cpp:409] conv5 forward: 122.753 ms. I0707 06:58:46.067870 9441 caffe.cpp:412] conv5 backward: 111.253 ms. I0707 06:58:46.067925 9441 caffe.cpp:409] relu5 forward: 0.9969 ms. I0707 06:58:46.067980 9441 caffe.cpp:412] relu5 backward: 0.00196 ms. I0707 06:58:46.068033 9441 caffe.cpp:409] pool5 forward: 6.49218 ms. I0707 06:58:46.068087 9441 caffe.cpp:412] pool5 backward: 0.00324 ms. I0707 06:58:46.068141 9441 caffe.cpp:409] fc6 forward: 256.357 ms. I0707 06:58:46.068197 9441 caffe.cpp:412] fc6 backward: 117.352 ms. I0707 06:58:46.068250 9441 caffe.cpp:409] relu6 forward: 0.10042 ms. I0707 06:58:46.068305 9441 caffe.cpp:412] relu6 backward: 0.00174 ms. I0707 06:58:46.068358 9441 caffe.cpp:409] drop6 forward: 0.42372 ms. I0707 06:58:46.068413 9441 caffe.cpp:412] drop6 backward: 0.00324 ms. I0707 06:58:46.068469 9441 caffe.cpp:409] fc7 forward: 136.134 ms. I0707 06:58:46.068522 9441 caffe.cpp:412] fc7 backward: 57.1792 ms. I0707 06:58:46.068577 9441 caffe.cpp:409] relu7 forward: 0.09016 ms. I0707 06:58:46.068631 9441 caffe.cpp:412] relu7 backward: 0.00196 ms. I0707 06:58:46.068686 9441 caffe.cpp:409] drop7 forward: 0.37678 ms. I0707 06:58:46.068739 9441 caffe.cpp:412] drop7 backward: 0.0037 ms. I0707 06:58:46.068794 9441 caffe.cpp:409] fc8 forward: 35.62 ms. I0707 06:58:46.068850 9441 caffe.cpp:412] fc8 backward: 20.6572 ms. I0707 06:58:46.068903 9441 caffe.cpp:409] prob forward: 0.48392 ms. I0707 06:58:46.068958 9441 caffe.cpp:412] prob backward: 0.13076 ms. I0707 06:58:46.069133 9441 caffe.cpp:417] Average Forward pass: 1580.33 ms. I0707 06:58:46.069190 9441 caffe.cpp:419] Average Backward pass: 1164.42 ms. I0707 06:58:46.069244 9441 caffe.cpp:421] Average Forward-Backward: 2745.12 ms. I0707 06:58:46.069344 9441 caffe.cpp:423] Total Time: 137256 ms. I0707 06:58:46.069401 9441 caffe.cpp:424] Benchmark ends

On Thu, Jul 6, 2017 at 11:29 PM, Yubin Wang notifications@github.com wrote:

@kaishijeng https://github.com/kaishijeng you may use

your/caffe/binary/caffe time -model alexnet.prototxt

@kaishijeng https://github.com/kaishijeng @honggui https://github.com/honggui by the way, you guys test the performance on desktop processor? Is there some statistics from mobile devices? and In the doc, the ACL_NEON seems slower than offical caffe with openblas. Which devices are tested? seems a long way to go if test on 32 bit platform, since 32 bit openblas don't use neon to speed up.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OAID/caffeOnACL/issues/2#issuecomment-313598602, or mute the thread https://github.com/notifications/unsubscribe-auth/AMGg3rnE6hS_dD7QjHifuZNeGJPFn9z-ks5sLdA5gaJpZM4OLa-n .

kaishijeng commented 7 years ago

If the above numbers are fair comparison, then ACL has 2.5 speedup over a pure CPU on a firefly platform.

I saw there is a MxnetonACL on github. Not sure whether it will have a plan to have TensorflowOnACL. because I use Tensorflow framework for most of my ML projects.

xhbdahai commented 7 years ago

Hi Kaishijeng: There is no clear plan so far with TensorFlow so far.

kaishijeng commented 7 years ago

Thanks for an update.

psyhtest commented 7 years ago

@xhbdahai, @honggui, @openailab-sh You will invariably end up with more questions about benchmarking Caffe-on-ACL against Caffe (or indeed other frameworks). Have you considered using / contributing to CK-Caffe? It's part of a growing suite of AI benchmarking tools based on Collective Knowledge, also including e.g. CK-Caffe2, CK-TensorFlow, CK-TensorRT, CK-KaNN.

For example, we have released benchmarking data for the Firefly-RK3399 platform that @kaishijeng uses.

For example, for the batch size of 2 (the smallest we have measured) on AlexNet (the closest to CaffeNet we have measured), we have obtained the following data for forward propagation (inference):

OpenBLAS: 695 ms
clBLAS: ~3700 ms
ViennaCL: ~3650 ms
CLBlast: ~4500 ms
libDNN w/ CLBlast: ~2160 ms
CLBlast (tuned by dividiti): ~1320 ms

(I can easily benchmark CaffeNet with the batch size of 1 if you are interested.)

Would you be interested in collaborating on adding Caffe-on-ACL to CK-Caffe?

psyhtest commented 7 years ago

As an added bonus, we already support ACL package and crowdbenchmarking across mobile devices.

OAIL commented 7 years ago

@psyhtest adding caffeOnACL to CK-Caffe is a good idea. will give you feedback after checking effort.

psyhtest commented 7 years ago

@OAIL How is the effort looking to you? :)

baynaa7 commented 6 years ago

Hello @honggui I am testing caffeACL vs caffe on tx2 board. however classification example on alexnet gives following result. arguments are exactly same with kaishijeng. caffeACL: elapsed time: [2.28925] seconds caffe: elapsed time: [1.2105] seconds

note both running on cpu version

any possible hypothesis for these results? Thanks in advance.

honggui commented 6 years ago

Hi pcub, The performance of the different layers, some may be better with ACL, and others may be better with OpenBLAS. To refer https://github.com/OAID/Caffe-HRT/blob/master/acl_openailab/user_manual.pdf, we could know the proper library for the operator. BTW, OpenBLAS's threads seems affect ACL's threads much. Sometimes we could use "export OPENBLAS_NUM_THREADS=1" to lower the side effect. Best Regards, Honggui

baynaa7 commented 6 years ago

Thanks @honggui export OPENBLAS_NUM_THREADS=1 works.

Steven9402 commented 6 years ago

Hi honggui, I want to test the performance of face recognition application on multi threads. Where I can add "CPPScheduler::set_num_threads(x)" to enable multi thread test for ACL? The executable that I use is OAID/FaceRecognition/bin/face-recognition.cpp. Also, I want to know if there is any interface to modify the number of threads used. Thanks a lot!

OAID / Caffe-HRT

performance gain ith ACL #2