Closed joaomiguelvieira closed 3 years ago
@joaomiguelvieira Hi,
What parameters in the Makefile do you use in both cases?
What command do you use?
Does any other yolo-model work fine on RPi?
Just to check, does yolov3-tiny_xnor_122000.weights
work on RPi if you comment this line? https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/detector.c#L1246
Hi @AlexeyAB,
I don’t set any parameters on the Makefile. All the accelerations are disabled.
The command I am using is simply ./darknet detector coco.data yolov3-tiny_xnor.cfg yolov3-tiny_xnor_122000.weights data/person.jpg
Every other models I tried on RPi work just fine. The xnor
version of yolov3 is the one that doesn’t work.
I will give you feedback about commenting the line in a moment.
Hi @AlexeyAB,
Sorry for the late answer. After commenting the line, the process does not seem to finish. It has been running for a long time now. Either way, I will let it run to see if it finishes or not.
@joaomiguelvieira Hi,
If the detection-process does not end, try to comment out this line: https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/detector.c#L1302
Also try to set OPENMP=1
in Makefile then it will work faster.
I didn't test it on RPi, so I don't know can it work fine on it.
Hi @AlexeyAB,
After all, it just finished running.
If I comment this line it works just fine and detects the objects. Should I run it with this line commented, then? https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/detector.c#L1246
Should I run it with this line commented, then?
No. It disables XNOR acceleration, so it uses XNOR in floats, just to make sure everything else in code is OK.
Do you use 32-bit or 64-bit OS on RPi?
Hi again @AlexeyAB,
I tried on both (a RPi 2 armv7 and a RPi 3 armv8). Neither works. On RPi 3 the OS is 64-bit and on the RPi 2 the OS is 32-bit.
@joaomiguelvieira
Un-comment this line: https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/detector.c#L1246
Try to change this line: https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/gemm.c#L2098 to these:
int tmp_count = 0;// popcnt_64(c_bit64);
int z;
for (z = 0; z < 64; ++z) {
tmp_count += (c_bit64 & 1);
c_bit64 = c_bit64 >> 1;
}
Does it work for XNOR?
Also does lscpu
show that your RPi 3 is a little endian or big endian?
The ARM architecture can run both little and big endianess, but Raspberry usually uses a little endian mode, the same as x86_64, so it shouldn't be a problem.
Hi @AlexeyAB,
That solved the problem. So it seems that popcnt_64(c_bit64);
is not doing what it should.
However, I am guessing that this pop function should be much faster than the for loop, am I right?
@joaomiguelvieira
That solved the problem. So it seems that
popcnt_64(c_bit64);
is not doing what it should.However, I am guessing that this pop function should be much faster than the for loop, am I right?
Yes. So we just localized the problem.
It looks like a bug.
Rollback all previous changes.
And try to change this line:
https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/gemm.c#L2072
to this:
tmp_count += __builtin_popcount(val64 >> 32);
Do make
and try to detect.
If it helps, then do another one change:
#if defined(__x86_64__) || defined(__aarch64__)
@AlexeyAB, this works for ARM 32-bit. I will update you about the 64-bit arm in a moment.
@joaomiguelvieira On ARM 64-bit you can try changes as previous, and temporary comment this line: https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/gemm.c#L2072
to make sure the code uses https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/gemm.c#L2069 instead of https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/gemm.c#L2071-L2072
Hi @AlexeyAB,
This also works for 64-bits.
@joaomiguelvieira Hi,
Thanks for your tests! I added this fix: https://github.com/AlexeyAB/darknet/commit/449fcfed7547a9203a7f44afd37835d373268201
What are your image and video response times for Pi 3 with this working configuration? I assume it compiled with OpenCV 4x?
Hi @spinoza1791,
My goal is to accelerate the XNORNet of tiny YoloV3 using embedded systems. Therefore, I did not measure response times. Nevertheless, should you be interested in measuring them and I can give you all the support files to do so (I cannot do it myself since I do not have access to the hardware anymore).
@joaomiguelvieira
My goal is to accelerate the XNORNet of tiny YoloV3 using embedded systems.
Do you plane to implement XNORnet by using SIMD ARM?
Hi @AlexeyAB,
I plan to do something a bit different than that. I plan to change ARM architecture (using gem5) to include a binary dot product unit. That will accelerate the binary convolution significantly.
Hi @spinoza1791,
My goal is to accelerate the XNORNet of tiny YoloV3 using embedded systems. Therefore, I did not measure response times. Nevertheless, should you be interested in measuring them and I can give you all the support files to do so (I cannot do it myself since I do not have access to the hardware anymore).
Yes, any support files you have would be great! andrew.craton@gmail.com
Hi @spinoza1791,
You can find the .weights file I used (trained myself) for coco dataset at https://www.mediafire.com/file/dm45vmfedz6e73z/yolov3-tiny_xnor_last.weights/file
The configuration file that you should use is located in cfg/yolov3-tiny_xnor.cfg
. You should, however, edit it before start detecting (set batch=64 and subdivisions=1).
The coco.data
file is located under cfg/coco.data
. Edit this file and put your own directories.
The coco.names
is located under cfg/coco.names
.
You can use the images at data/
to detect objects.
Should you need anything else, let me know.
@joaomiguelvieira
I plan to do something a bit different than that. I plan to change ARM architecture (using gem5) to include a binary dot product unit. That will accelerate the binary convolution significantly.
Do you want to implement just fused XOR+POPCNT as SIMD instructions for ARM architecture?
Or do you want to implement even SIMD-GEMM like WMMA-GEMM on Tensor Cores on nVidia GPU (wmma::bmma_sync()
that does dot-product on 8x8x128 bits matrix).
@AlexeyAB,
I implemented XOR+POPCNT as a single instruction for the ARMv8 architecture. However, it goes a little further than that: some filters are stored in the execution path of the CPU, in a special memory. When those filters are used, the program does not have to load them from the main memory. This can accelerate the execution in the order of 20%. Furthermore, memory accesses are reduced, and this solution also has some benefits in terms of energy efficiency.
This is a research project, so I would not expect such architecture to become available soon. Nevertheless, it is an interesting topic.
Results: 2.8x speedup on Pi 3B (armv7l w/ OpenMP and ARM_NEON optimized in Makefile) Loading weights from yolov3-tiny_xnor_last.weights... seen 64 Done! data/person.jpg: Predicted in 1778.256000 milli-seconds. person: 37% VS. Loading weights from yolov3-tiny.weights... seen 64 Done! data/person.jpg: Predicted in 5112.976000 milli-seconds. dog: 89% dog: 82% person: 98% sheep: 83%
If we combined the xnor work here with the NNPack work of https://github.com/shizukachan/darknet-nnpack (it is currently not updated with working xnor functions), we would have a close to real-time FPS for Pi3. For example, I can get 1.1 FPS (on Pi3) with yolov3-tiny.weights using nnpack (better optimization). If the nnpack were made compatible with xnor, we could potentially see yolov3-tiny_xnor running at ~3-4 FPS!
@spinoza1791
Results: 2.8x speedup on Pi 3B (armv7l w/ OpenMP and ARM_NEON optimized in Makefile)
Loading weights from yolov3-tiny_xnor_last.weights... data/person.jpg: Predicted in 1778.256000 milli-seconds. ... Loading weights from yolov3-tiny.weights... data/person.jpg: Predicted in 5112.976000 milli-seconds.
Did you optimize this repo https://github.com/AlexeyAB/darknet by using ARM_NEON?
@spinoza1791
Results: 2.8x speedup on Pi 3B (armv7l w/ OpenMP and ARM_NEON optimized in Makefile) Loading weights from yolov3-tiny_xnor_last.weights... data/person.jpg: Predicted in 1778.256000 milli-seconds. ... Loading weights from yolov3-tiny.weights... data/person.jpg: Predicted in 5112.976000 milli-seconds.
Did you optimize this repo https://github.com/AlexeyAB/darknet by using ARM_NEON?
Yes, I used the ARM_NEON opt in my test above. ARM_NEON + OpenMP in my test above is still much slower than ARM_NEON + NNPack (1.1FPS on yolov3_tiny) for Pi3. The NNPack multithreading is far better optimized for Pi than OpenMP. This is why we should add an NNPack opt to this repo, so that we can combined Xnor + ARM_NEON + NNPack for best results on Pi.
@spinoza1791
Did you try Yolo v3 on Pi3 that is implemented inside OpenCV-dnn module? https://github.com/opencv/opencv/blob/8bde6aea4ba19454554aa008922d967b552e79cc/samples/dnn/object_detection.cpp#L192-L222
What time and FPS can you get?
@spinoza1791
Did you try Yolo v3 on Pi3 that is implemented inside OpenCV-dnn module? https://github.com/opencv/opencv/blob/8bde6aea4ba19454554aa008922d967b552e79cc/samples/dnn/object_detection.cpp#L192-L222
What time and FPS can you get?
Here are my benchmarks for Pi3B+ all using OpenCV w/NEON/FPV4/TBB support Image detector test results, in secs, are best-of-five trials, using "person.jpg":
yolov2-tiny - ocv3.4.0 - DNN module openmp 1.21s
yolov3-tiny - ocv4.0.1 - DNN module openmp 1.32s
yolov2-tiny - ocv4.0.1 - DNN module openmp 1.41s
yolov3-tiny - ocv3.4.0 - darknet(alexeyAB) openmp 8.46s openmp+neon 4.90s
yolov3-tiny_xnor - ocv3.4.0 - darknet(alexeyAB) openmp 1.95s openmp+neon 1.56s
yolov3-tiny - ocv3.4.0 - darknet-nnpack(shizukachan) openmp 9.66s openmp+neon 5.37s nnpack 0.81s nnpack+neon 0.75s Best result
@spinoza1791 Maybe I will try to optimize XNOR for ARM CPU.
I used his weight and cfg file,in the darknet-no-gpu version,but found the error, Done! not used FMA & AVX2 used AVX error: is no non-optimized version
My cpu support the AVX directive.
@andeyeluguo
maybe the reason is that your code support AVX2,but not AVX, I think.
the code in gemm.c in the function im2col_cpu_custom_bin(),shows that you only fulfill the avx2 version,maybe
@andeyeluguo
Try to change this line: https://github.com/AlexeyAB/darknet/blob/cce34712f6928495f1fbc5d69332162fc23491b9/src/gemm.c#L515
to this
#if (defined(__AVX__) && defined(__x86_64__)) || defined(_WIN64_DISABLED)
it can process a image,but it shows that not compiled with opencv
when I try to run it with a video, darknet_no_gpu.exe detector demo data/coco.data joao_xnor/yolov3-tiny_xnor.cfg joao_xnor/yolov3-tiny_xnor_122000.weights -i 1 ./video/room.mp4 it shows that 'Demo needs OpenCV for webcam images'
@andeyeluguo Yes, you should compile Darknet with OpenCV to process video from files/cameras: https://github.com/AlexeyAB/darknet#requirements
@joaomiguelvieira Hi,
I plan to do something a bit different than that. I plan to change ARM architecture (using gem5) to include a binary dot product unit. That will accelerate the binary convolution significantly.
Do you have any interesting results?
Hi, @AlexeyAB,
Indeed I have some interesting results. I was able to improve 10% performance and 8% energy efficiency for the Cortex-A53. I will be able to share the artifact very soon, as this work is in the process of being published.
@joaomiguelvieira Thanks, it will be interesting. Can you share your source C code of XNOR_GEMM that you test on your changed ARM architecture?
Hi @AlexeyAB,
Certainly. Please, find the files that I modified attached. darknet_mod.zip
As soon as I get permission, I will also send you the paper so you can see in detail what I changed.
@joaomiguelvieira Thanks!
Maybe I will try to optimize XNOR for ARM CPU.
Has this optimization been done on the darknet?
Dear @joaomiguelvieira, thanks for your comment I need some reference to know how can I make a BCNN by darknet and implement it on a raspberry pi. would it be possible to guide me?
Hello, @EhsanVahab,
Darknet already has some BNNs ready to use out-of-the-box. The corresponding configuration files have the suffix _xnor.cfg
.
To create a BNN from an existing CNN configuration file you should add the line xnor=1
in all convolutional layers except for the first one. For instance, compare the files tiny-yolo_xnor.cfg and tiny-yolo.cfg. You will see that the only difference is that in tiny-yolo_xnor.cfg the first line of each convolutional layer except for the first one is xnor=1
.
After configuring your BNN, you will have to train it. There is no easy way of binarizing the weights of a pre-trained CNN, so it will have to be trained from scratch. If you want to run the flow and get started with darknet and BNNs before getting to trainning, I suggest you try to use the .weights
file that I have supplied earlier in this thread: https://www.mediafire.com/file/dm45vmfedz6e73z/yolov3-tiny_xnor_last.weights/file. These weights refer to the Street View House Numbers dataset.
I hope this may help you.
@joaomiguelvieira , thanks for your quick and complete response to my question. I will train my model then share the results with you, accuracy and speed on rasp pi. your comment is very helpful.
@joaomiguelvieira @AlexeyAB nice job done. I was wonder how the xnor_weight came from? Is there some way to get? I try the tiny_v3_pan3.cfg to compare the yolov3_tiny_3l.cfg, the xnor pretrain model the yolov3_tiny_3l I can get from the yolov3_tiny_xnor_last.weights, but how the the tiny_v3_pan3 pretrain model get? Is there some way how to get the xnor pretrain model?
@PiseyYou
All v3_tiny... models use the same first several layers. So you can use the same partial pre-traied weights file for: yolov3_tiny, yolov3_tiny_3l, tiny_v3_pan3....
@AlexeyAB Thanks, AlexyAB, I will have a try and will feedback later.
@AlexeyAB Hi, I had trained the yolov3-tiny_xnor and wanted to print the data of feature maps of middle layers during 1 image detection. I don't know how to modify your code?
Hi @AlexeyAB,
I am trying to run yolov3-tiny_xnor.cfg for detection in a raspberry pi. I have trained the network, tested it on an Intel-based system and it just works fine. However, when I run it on the RPi, nothing is detected! I am using the very same command and the very same version of the framework on both sides. Can you help me figure out what is going on?
I am using the command
./darknet detector test data/coco.data cfg/yolov3-tiny_xnor.cfg yolov3-tiny_xnor_last.weigths data/person.jpg
The content of
coco.data
isThe content of
yolov3-tiny_xnor.cfg
The
.weights
file can be found here: http://www.mediafire.com/file/vahpux9xefw1tci/yolov3-tiny_xnor_122000.weightsFinally, the person.jpg image is the one already present in the
data
folder.