ameingast / cocoaimagehashing

Perceptual Image Hashing for macOS, iOS, tvOS and watchOS
Other
262 stars 66 forks source link

Similarity search profiling #8

Closed plumbojumbo closed 7 years ago

plumbojumbo commented 7 years ago

I've noticed a couple of bottlenecks that make similarity search unusually slow. The following changes should yield a drastic reduction in execution time, in my test case to roughly 4 % compared to the current master branch (i. e. ~25x speed-up), while not drifting into micro-optimization too much. Since this involves quite some changes I've tried to preserve and clarify intermediate steps as much as possible.

Methodology

I used this simple program to run a similarity search on 20666 JPEG images of two different sizes stored on a RAM disk. Each image had at least one duplicate. The profiling program run on macOS 10.12 with 4 logical cores and 16 GB of RAM. Program and framework were build with -Ofast, -mavx2, LTO on and PIE off (main program only).

Baseline

The current baseline code took around 4 minutes and 32 seconds, which is unuasually slow given the number of images and type of hardware. By looking at the CPU profile we see that the program spent most of the time creating new arrays and objects:

baseline_cpu

This is even more obvious given the fact that the program needed almost 23 GB of memory to complete:

baseline_memory

The root problem here is that - [OSSimilaritySearch similarImagesWithProvider:withHashDistanceThreshold:forImageStreamHandler:forResultHandler:] calls [OSImageHashing hashImageData:withProviderId:] for every image and - [OSImageHashing hashDistance:to:withProviderId:] for every pair combination of images. Those methods both call OSImageHashingProviderFromImageHashingProviderId(), which calls down to NSArrayForProvidersFromOSImageHashingProviderId() and picks the first object from the returned array. So for every image and every pair combination of images one array and three image hashing provider objects are created and subsequently discarded. That is 4 n n! of extra time and space whereas we really only need one image hashing object during the whole process. The first step thus is to create the image hashing provider object out of the loop in similarImagesWithProvider:withHashDistanceThreshold:forImageStreamHandler:forResultHandler:. We can also introduce shared instances for hashing providers, just in case. Additionally, this fixes a bug where NSArrayForProvidersFromOSImageHashingProviderId() would return an aHash provider before a pHash provider, contradicting it's definition.

Step 2

The above changes alone drastically reduce execution time down to 39s:

3885476

We can see now that there is still a lot of time spent on Objective-C lifecycle calls (objc_retain, objc_release) as well as Objective-C messages in general (objc_msgSend). Our new target is - [NSArray(CocoaImageHashing) arrayWithPairCombinations:withResultHandler:]. This method has a couple of problems.

First, it doesn't cache neither the element count nor the the outer loop's element, resulting in unneccessary Objective-C messages.

Second, every time an object is selected from the array and passed to matcher() or resultHandler(), ARC inserts respective retain/release calls on both sides. But in this case, all those calls are completely unneccessary because the NSArray already holds strong references to it's objects and the only way for any of them to disappear would be if the NSArray was mutated while we iterate it, which would be a programming error in any case. Instructing ARC to not emit retain/release calls for objects that we know can't go away, saves a lot of Objective-C calls. The same applies for NSDictionary iterations.

Third, instead of using single Objective-C calls to get each array object, we can issue one call to get all objects in a C array and then iterate that.

Fourth, while the actual hashing process is multi-threaded, iteration of image pair combinations for comparison currently happens on one thread only. This can be parallized easily, using spin locks to synchronize calling methods' critical sections (see below for a note on the use of spin locks).

At this stage, with - [NSArray(CocoaImageHashing) arrayWithPairCombinations:withResultHandler:] not really returning an array and the matcher() block de facto consisting of a single if statement, we can simplify the interface to align with - [NSArray enumerateObjectsUsingBlock:], thus also saving one extra block call.

9243ce1

Step 3

We can now apply additional minor optimizations, making the inner loop almost completely free of Objective-C calls.

This brings run time down to about 10 seconds, with around one quarter of that being spent on filesystem traversal:

315729c

The remaining changes are of cosmetic nature:

Note on spin locks

Spin locks are disputed on iOS because they can lead to priority inversion. 1 2 In this case, however, the only critical section protected is - [NSMutableArray addObject:], which is so tiny that it's very unlikely to lead to any congestion at all (cmp.). Additionally, 25f1a01 makes use of the solution that Apple has used for a while in the Objective-C runtime and made public in 10.12/10.10.

ameingast commented 7 years ago

Great stuff again, thanks a lot!