yolov3-tiny_xnor.cfg running on ARM

joaomiguelvieira commented 5 years ago

Hi @AlexeyAB,

I am trying to run yolov3-tiny_xnor.cfg for detection in a raspberry pi. I have trained the network, tested it on an Intel-based system and it just works fine. However, when I run it on the RPi, nothing is detected! I am using the very same command and the very same version of the framework on both sides. Can you help me figure out what is going on?

I am using the command ./darknet detector test data/coco.data cfg/yolov3-tiny_xnor.cfg yolov3-tiny_xnor_last.weigths data/person.jpg

The content of coco.data is

classes = 80
names   = data/coco/coco.names
backup  = backup/

The content of yolov3-tiny_xnor.cfg

[net]
# Testing
batch=1
subdivisions=1
# Training
# batch=64
# subdivisions=2
width=416
height=416
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.001
burn_in=1000
max_batches = 500200
policy=steps
steps=400000,450000
scales=.1,.1

[convolutional]
batch_normalize=1
filters=16
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
xnor=1
bin_output=1
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
xnor=1
bin_output=1
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
xnor=1
bin_output=1
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
xnor=1
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
xnor=1
bin_output=1
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=1

[convolutional]
xnor=1
bin_output=1
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

###########

[convolutional]
xnor=1
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=255
activation=linear

[yolo]
mask = 3,4,5
anchors = 10,14,  23,27,  37,58,  81,82,  135,169,  344,319
classes=80
num=6
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

[route]
layers = -4

[convolutional]
xnor=1
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=2

[route]
layers = -1, 8

[convolutional]
xnor=1
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=255
activation=linear

[yolo]
mask = 0,1,2
anchors = 10,14,  23,27,  37,58,  81,82,  135,169,  344,319
classes=80
num=6
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

The .weights file can be found here: http://www.mediafire.com/file/vahpux9xefw1tci/yolov3-tiny_xnor_122000.weights

Finally, the person.jpg image is the one already present in the data folder.

AlexeyAB commented 5 years ago

@joaomiguelvieira Hi,

What parameters in the Makefile do you use in both cases?
What command do you use?
Does any other yolo-model work fine on RPi?
Just to check, does yolov3-tiny_xnor_122000.weights work on RPi if you comment this line? https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/detector.c#L1246

joaomiguelvieira commented 5 years ago

Hi @AlexeyAB,

I don’t set any parameters on the Makefile. All the accelerations are disabled.

The command I am using is simply ./darknet detector coco.data yolov3-tiny_xnor.cfg yolov3-tiny_xnor_122000.weights data/person.jpg

Every other models I tried on RPi work just fine. The xnor version of yolov3 is the one that doesn’t work.

I will give you feedback about commenting the line in a moment.

joaomiguelvieira commented 5 years ago

Hi @AlexeyAB,

Sorry for the late answer. After commenting the line, the process does not seem to finish. It has been running for a long time now. Either way, I will let it run to see if it finishes or not.

AlexeyAB commented 5 years ago

@joaomiguelvieira Hi,

If the detection-process does not end, try to comment out this line: https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/detector.c#L1302

Also try to set OPENMP=1 in Makefile then it will work faster.

I didn't test it on RPi, so I don't know can it work fine on it.

joaomiguelvieira commented 5 years ago

Hi @AlexeyAB,

After all, it just finished running.

If I comment this line it works just fine and detects the objects. Should I run it with this line commented, then? https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/detector.c#L1246

AlexeyAB commented 5 years ago

Should I run it with this line commented, then?

No. It disables XNOR acceleration, so it uses XNOR in floats, just to make sure everything else in code is OK.

Do you use 32-bit or 64-bit OS on RPi?

joaomiguelvieira commented 5 years ago

Hi again @AlexeyAB,

I tried on both (a RPi 2 armv7 and a RPi 3 armv8). Neither works. On RPi 3 the OS is 64-bit and on the RPi 2 the OS is 32-bit.

AlexeyAB commented 5 years ago

@joaomiguelvieira

Un-comment this line: https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/detector.c#L1246

Try to change this line: https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/gemm.c#L2098 to these:

                int tmp_count = 0;// popcnt_64(c_bit64);
                int z;
                for (z = 0; z < 64; ++z) {
                    tmp_count += (c_bit64 & 1);
                    c_bit64 = c_bit64 >> 1;
                }

Does it work for XNOR?

Also does lscpu show that your RPi 3 is a little endian or big endian?

The ARM architecture can run both little and big endianess, but Raspberry usually uses a little endian mode, the same as x86_64, so it shouldn't be a problem.

joaomiguelvieira commented 5 years ago

Hi @AlexeyAB,

That solved the problem. So it seems that popcnt_64(c_bit64); is not doing what it should.

However, I am guessing that this pop function should be much faster than the for loop, am I right?

AlexeyAB commented 5 years ago

@joaomiguelvieira

That solved the problem. So it seems that popcnt_64(c_bit64); is not doing what it should.

However, I am guessing that this pop function should be much faster than the for loop, am I right?

Yes. So we just localized the problem.

It looks like a bug.

Rollback all previous changes.

And try to change this line: https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/gemm.c#L2072 to this: tmp_count += __builtin_popcount(val64 >> 32);

Do make and try to detect.

If it helps, then do another one change:

Change this line: https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/gemm.c#L2068 to this #if defined(__x86_64__) || defined(__aarch64__)

joaomiguelvieira commented 5 years ago

@AlexeyAB, this works for ARM 32-bit. I will update you about the 64-bit arm in a moment.

AlexeyAB commented 5 years ago

@joaomiguelvieira On ARM 64-bit you can try changes as previous, and temporary comment this line: https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/gemm.c#L2072

to make sure the code uses https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/gemm.c#L2069 instead of https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/gemm.c#L2071-L2072

joaomiguelvieira commented 5 years ago

Hi @AlexeyAB,

This also works for 64-bits.

AlexeyAB commented 5 years ago

@joaomiguelvieira Hi,

Thanks for your tests! I added this fix: https://github.com/AlexeyAB/darknet/commit/449fcfed7547a9203a7f44afd37835d373268201

spinoza1791 commented 5 years ago

What are your image and video response times for Pi 3 with this working configuration? I assume it compiled with OpenCV 4x?

joaomiguelvieira commented 5 years ago

Hi @spinoza1791,

My goal is to accelerate the XNORNet of tiny YoloV3 using embedded systems. Therefore, I did not measure response times. Nevertheless, should you be interested in measuring them and I can give you all the support files to do so (I cannot do it myself since I do not have access to the hardware anymore).

AlexeyAB commented 5 years ago

@joaomiguelvieira

My goal is to accelerate the XNORNet of tiny YoloV3 using embedded systems.

Do you plane to implement XNORnet by using SIMD ARM?

joaomiguelvieira commented 5 years ago

Hi @AlexeyAB,

I plan to do something a bit different than that. I plan to change ARM architecture (using gem5) to include a binary dot product unit. That will accelerate the binary convolution significantly.

spinoza1791 commented 5 years ago

Hi @spinoza1791,

My goal is to accelerate the XNORNet of tiny YoloV3 using embedded systems. Therefore, I did not measure response times. Nevertheless, should you be interested in measuring them and I can give you all the support files to do so (I cannot do it myself since I do not have access to the hardware anymore).

Yes, any support files you have would be great! andrew.craton@gmail.com

joaomiguelvieira commented 5 years ago

Hi @spinoza1791,

You can find the .weights file I used (trained myself) for coco dataset at https://www.mediafire.com/file/dm45vmfedz6e73z/yolov3-tiny_xnor_last.weights/file

The configuration file that you should use is located in cfg/yolov3-tiny_xnor.cfg. You should, however, edit it before start detecting (set batch=64 and subdivisions=1).

The coco.data file is located under cfg/coco.data. Edit this file and put your own directories.

The coco.names is located under cfg/coco.names.

You can use the images at data/ to detect objects.

Should you need anything else, let me know.

AlexeyAB commented 5 years ago

@joaomiguelvieira

I plan to do something a bit different than that. I plan to change ARM architecture (using gem5) to include a binary dot product unit. That will accelerate the binary convolution significantly.

Do you want to implement just fused XOR+POPCNT as SIMD instructions for ARM architecture? Or do you want to implement even SIMD-GEMM like WMMA-GEMM on Tensor Cores on nVidia GPU (wmma::bmma_sync() that does dot-product on 8x8x128 bits matrix).

joaomiguelvieira commented 5 years ago

@AlexeyAB,

I implemented XOR+POPCNT as a single instruction for the ARMv8 architecture. However, it goes a little further than that: some filters are stored in the execution path of the CPU, in a special memory. When those filters are used, the program does not have to load them from the main memory. This can accelerate the execution in the order of 20%. Furthermore, memory accesses are reduced, and this solution also has some benefits in terms of energy efficiency.

This is a research project, so I would not expect such architecture to become available soon. Nevertheless, it is an interesting topic.

spinoza1791 commented 5 years ago

Results: 2.8x speedup on Pi 3B (armv7l w/ OpenMP and ARM_NEON optimized in Makefile) Loading weights from yolov3-tiny_xnor_last.weights... seen 64 Done! data/person.jpg: Predicted in 1778.256000 milli-seconds. person: 37% VS. Loading weights from yolov3-tiny.weights... seen 64 Done! data/person.jpg: Predicted in 5112.976000 milli-seconds. dog: 89% dog: 82% person: 98% sheep: 83%

If we combined the xnor work here with the NNPack work of https://github.com/shizukachan/darknet-nnpack (it is currently not updated with working xnor functions), we would have a close to real-time FPS for Pi3. For example, I can get 1.1 FPS (on Pi3) with yolov3-tiny.weights using nnpack (better optimization). If the nnpack were made compatible with xnor, we could potentially see yolov3-tiny_xnor running at ~3-4 FPS!

AlexeyAB commented 5 years ago

@spinoza1791

Results: 2.8x speedup on Pi 3B (armv7l w/ OpenMP and ARM_NEON optimized in Makefile)

Loading weights from yolov3-tiny_xnor_last.weights... data/person.jpg: Predicted in 1778.256000 milli-seconds. ... Loading weights from yolov3-tiny.weights... data/person.jpg: Predicted in 5112.976000 milli-seconds.

Did you optimize this repo https://github.com/AlexeyAB/darknet by using ARM_NEON?

spinoza1791 commented 5 years ago

@spinoza1791

Results: 2.8x speedup on Pi 3B (armv7l w/ OpenMP and ARM_NEON optimized in Makefile) Loading weights from yolov3-tiny_xnor_last.weights... data/person.jpg: Predicted in 1778.256000 milli-seconds. ... Loading weights from yolov3-tiny.weights... data/person.jpg: Predicted in 5112.976000 milli-seconds.

Did you optimize this repo https://github.com/AlexeyAB/darknet by using ARM_NEON?

Yes, I used the ARM_NEON opt in my test above. ARM_NEON + OpenMP in my test above is still much slower than ARM_NEON + NNPack (1.1FPS on yolov3_tiny) for Pi3. The NNPack multithreading is far better optimized for Pi than OpenMP. This is why we should add an NNPack opt to this repo, so that we can combined Xnor + ARM_NEON + NNPack for best results on Pi.

AlexeyAB commented 5 years ago

@spinoza1791

Did you try Yolo v3 on Pi3 that is implemented inside OpenCV-dnn module? https://github.com/opencv/opencv/blob/8bde6aea4ba19454554aa008922d967b552e79cc/samples/dnn/object_detection.cpp#L192-L222

What time and FPS can you get?

spinoza1791 commented 5 years ago

@spinoza1791

Did you try Yolo v3 on Pi3 that is implemented inside OpenCV-dnn module? https://github.com/opencv/opencv/blob/8bde6aea4ba19454554aa008922d967b552e79cc/samples/dnn/object_detection.cpp#L192-L222

What time and FPS can you get?

Here are my benchmarks for Pi3B+ all using OpenCV w/NEON/FPV4/TBB support Image detector test results, in secs, are best-of-five trials, using "person.jpg":

yolov2-tiny - ocv3.4.0 - DNN module openmp 1.21s

yolov3-tiny - ocv4.0.1 - DNN module openmp 1.32s

yolov2-tiny - ocv4.0.1 - DNN module openmp 1.41s

yolov3-tiny - ocv3.4.0 - darknet(alexeyAB) openmp 8.46s openmp+neon 4.90s

yolov3-tiny_xnor - ocv3.4.0 - darknet(alexeyAB) openmp 1.95s openmp+neon 1.56s

yolov3-tiny - ocv3.4.0 - darknet-nnpack(shizukachan) openmp 9.66s openmp+neon 5.37s nnpack 0.81s nnpack+neon 0.75s Best result

AlexeyAB commented 5 years ago

@spinoza1791 Maybe I will try to optimize XNOR for ARM CPU.

andeyeluguo commented 5 years ago

I used his weight and cfg file,in the darknet-no-gpu version,but found the error, Done! not used FMA & AVX2 used AVX error: is no non-optimized version

My cpu support the AVX directive.

AlexeyAB commented 5 years ago

@andeyeluguo

What CPU do you use?
Do you compile for x64 & Release? https://hsto.org/webt/uh/fk/-e/uhfk-eb0q-hwd9hsxhrikbokd6u.jpeg

andeyeluguo commented 5 years ago

I used Intel Xeon E5 2697 v2
I generate for x64 &release

andeyeluguo commented 5 years ago

maybe the reason is that your code support AVX2,but not AVX, I think.

andeyeluguo commented 5 years ago

the code in gemm.c in the function im2col_cpu_custom_bin(),shows that you only fulfill the avx2 version,maybe

AlexeyAB commented 5 years ago

@andeyeluguo

Try to change this line: https://github.com/AlexeyAB/darknet/blob/cce34712f6928495f1fbc5d69332162fc23491b9/src/gemm.c#L515 to this #if (defined(__AVX__) && defined(__x86_64__)) || defined(_WIN64_DISABLED)

andeyeluguo commented 5 years ago

it can process a image,but it shows that not compiled with opencv

when I try to run it with a video, darknet_no_gpu.exe detector demo data/coco.data joao_xnor/yolov3-tiny_xnor.cfg joao_xnor/yolov3-tiny_xnor_122000.weights -i 1 ./video/room.mp4 it shows that 'Demo needs OpenCV for webcam images'

AlexeyAB commented 5 years ago

@andeyeluguo Yes, you should compile Darknet with OpenCV to process video from files/cameras: https://github.com/AlexeyAB/darknet#requirements

AlexeyAB commented 5 years ago

@joaomiguelvieira Hi,

I plan to do something a bit different than that. I plan to change ARM architecture (using gem5) to include a binary dot product unit. That will accelerate the binary convolution significantly.

Do you have any interesting results?

joaomiguelvieira commented 5 years ago

Hi, @AlexeyAB,

Indeed I have some interesting results. I was able to improve 10% performance and 8% energy efficiency for the Cortex-A53. I will be able to share the artifact very soon, as this work is in the process of being published.

AlexeyAB commented 5 years ago

@joaomiguelvieira Thanks, it will be interesting. Can you share your source C code of XNOR_GEMM that you test on your changed ARM architecture?

joaomiguelvieira commented 5 years ago

Hi @AlexeyAB,

Certainly. Please, find the files that I modified attached. darknet_mod.zip

As soon as I get permission, I will also send you the paper so you can see in detail what I changed.

AlexeyAB commented 5 years ago

@joaomiguelvieira Thanks!

EhsanVahab commented 4 years ago

Maybe I will try to optimize XNOR for ARM CPU.

Has this optimization been done on the darknet?

joaomiguelvieira commented 4 years ago

@AlexeyAB, as promised, here’s the paper: https://web.tecnico.ulisboa.pt/~joaomiguelvieira/public/docs/papers/a_product_engine_for_energy-efficient_execution_of_binary_neural_networks_using_resistive_memories.pdf

EhsanVahab commented 4 years ago

Dear @joaomiguelvieira, thanks for your comment I need some reference to know how can I make a BCNN by darknet and implement it on a raspberry pi. would it be possible to guide me?

joaomiguelvieira commented 4 years ago

Hello, @EhsanVahab, Darknet already has some BNNs ready to use out-of-the-box. The corresponding configuration files have the suffix _xnor.cfg. To create a BNN from an existing CNN configuration file you should add the line xnor=1 in all convolutional layers except for the first one. For instance, compare the files tiny-yolo_xnor.cfg and tiny-yolo.cfg. You will see that the only difference is that in tiny-yolo_xnor.cfg the first line of each convolutional layer except for the first one is xnor=1. After configuring your BNN, you will have to train it. There is no easy way of binarizing the weights of a pre-trained CNN, so it will have to be trained from scratch. If you want to run the flow and get started with darknet and BNNs before getting to trainning, I suggest you try to use the .weights file that I have supplied earlier in this thread: https://www.mediafire.com/file/dm45vmfedz6e73z/yolov3-tiny_xnor_last.weights/file. These weights refer to the Street View House Numbers dataset. I hope this may help you.

EhsanVahab commented 4 years ago

@joaomiguelvieira , thanks for your quick and complete response to my question. I will train my model then share the results with you, accuracy and speed on rasp pi. your comment is very helpful.

PiseyYou commented 4 years ago

@joaomiguelvieira @AlexeyAB nice job done. I was wonder how the xnor_weight came from? Is there some way to get? I try the tiny_v3_pan3.cfg to compare the yolov3_tiny_3l.cfg, the xnor pretrain model the yolov3_tiny_3l I can get from the yolov3_tiny_xnor_last.weights, but how the the tiny_v3_pan3 pretrain model get? Is there some way how to get the xnor pretrain model?

AlexeyAB commented 4 years ago

@PiseyYou

All v3_tiny... models use the same first several layers. So you can use the same partial pre-traied weights file for: yolov3_tiny, yolov3_tiny_3l, tiny_v3_pan3....

PiseyYou commented 4 years ago

@AlexeyAB Thanks, AlexyAB, I will have a try and will feedback later.

Yingxiu-Chang commented 4 years ago

@AlexeyAB Hi, I had trained the yolov3-tiny_xnor and wanted to print the data of feature maps of middle layers during 1 image detection. I don't know how to modify your code?

AlexeyAB / darknet

yolov3-tiny_xnor.cfg running on ARM #2382