MaxDepth > 1 for CascadeClassifier

TonyCongqianWang commented 1 month ago

Hello,

I noticed that your library does not support trained Cascade Classifiers with Depth > 1. I was trying to add support for it myself, but I was stuck at SimdDetectionLbpDetect16ii. I could not find the Definition of Neon::SimdDetectionLbpDetect16ii or Base::SimdDetectionLbpDetect16ii. Are those functions written in assembly and imported? I would like to still use your simd implementation for extracting the lbp features, but implement my own decision tree

ermig1979 commented 1 month ago

Hi! Look for Neon::DetectionLbpDetect16ii and Base::DetectionLbpDetect16ii. These functins are defined in files SimdNeonDetection.cpp and SimdBaseDetection.cpp.

TonyCongqianWang commented 1 month ago

Thank you so much for your fast reply! I think I have mainly understood the Base::Detect method, but I am still a bit unsure about what exactly the line

sum += leaves[subset[c >> 5] & (1 << (c & 31)) ? leafOffset : leafOffset + 1];

does. I believe there are two different leaf values, depending on wether the feature is active according to some condition stored in subset. Is that correct? But how exactly is this condition evaluated? If so, why node thresholds not part of node, but part stored in subsets instead? Does that allow for better simd optimization? It looks to me as though (regardless of base or neon) Calculate is always called to calculate the Lbp Values. It would have thought, that calculating the Lbp features is the most expensive part of the detect function and thus, it would make sense to cache feature values if features are shared by stages.

As for the Neon::Detect method. I am a bit confused. I believe it does the same thing as is confirmed by your unit tests, but I am still unsure about a few things:

Is it true that for the 32i version, it evalues 4 neighbouring windows at the same time and evaluates all stages unless when more than one windows is positive and continues evaluation with the version when only one windows is positive?
What are the meaning of shuffle and mask in the Neon::Calculate implementation?
What is the meaning of vmvnq_u32(vceqq_u32(value, K32_00000000)); it looks to me that there are two bitwise not operations which does not make sense.
why are two _subset values loaded in leafMask regardless of u16 or u32 implementation?

It seems to me that the Base::Detect is somewhat easy to manipulate to allow more than depth 1 trees but the neon version is not. It would be enough for me to only use simd instructions for lbp feature calculation which is done in LeafMask it seems. If I give LeafMask my root node thresholds as subset parameter I should get left/right traversal decisions, correct?

In general I am wondering: are you interested in adding support for depths > 1 or do you think it is not worth it?

ermig1979 commented 1 month ago

I wrote this code more than 10 years ago and can't remember some details.

sum += leaves[subset[c >> 5] & (1 << (c & 31)) ? leafOffset : leafOffset + 1];

subset is a array of 8 int 32 which store 256 1-bit values. subset[c >> 5] & (1 << (c & 31)) gets on of these boolian values by c index.

TonyCongqianWang commented 1 month ago

Again, thanks a lot for your help! Now I understand, c >> 5 and c & 32 are equivalent to c / 32 and c % 32. Good to know that the decision boundary or a given lbp feature can be arbitrary and not just some threshold.

So after some thinking I concluded, that it shouldnt be too hard to convert the code to allow higher depth trees. It might be as simple as adding one for loop and saving the decision in leaves (either directly as the new offset index, or as 0 / 1 for left and right, and using the usual tree traversing logic to calculte the new offset index.

Would you be interested to add this?

ermig1979 commented 1 month ago

Unfortunately no. There are following reasons:

The HAAR and LBP cascade classifiers have much less accuracy and performance compare to solution based on neural network. So there is no sense to optimize legacy algorithms. My priorities are optimizations of DL based algorithms.
Current SIMD optimizations uses that fact then data of cascad classifiers in (stump based case Depth = 1) is the same for every point of image. If alorithm has branches it makes very difficult to use SIMD.

Certainly you can try to make optimizations of this case on one's own. If you will do it I add with pleasure your solution to main SIMD branch.

TonyCongqianWang commented 1 month ago

Oh thanks again for your input. What you said makes a lot of sense. Multiple windows are evaluated at the same time, if two windows branch differently, simd operations don't work anymore. That is a pity. The only easy optimization would be to use the current simd implementation for the root, and then go to the slower base version when branching. That does indeed defeat the purpose of the simd library and using the opencv version might be better at that point

TonyCongqianWang commented 1 month ago

Regarding DL algorithms: Are there any full pipelines implemented in SIMD yet? Can you recommend an architechture that has fast CPU performance? When I use your cascade with LBP features, I need around 1ms per image (250 x 200) on my laptop and around 15 ms on my raspberry pi.

ermig1979 commented 1 month ago

I develop Synet. This framework allows to infer trained neral models and uses Simd as backend.

TonyCongqianWang commented 1 month ago

Thanks a lot, it looks great! I will definitely install it and test its performance for my use

ermig1979 / Simd

MaxDepth > 1 for CascadeClassifier #273