I've noticed a couple of bottlenecks that make similarity search unusually slow. The following changes should yield a drastic reduction in execution time, in my test case to roughly 4 % compared to the current master branch (i. e. ~25x speed-up), while not drifting into micro-optimization too much. Since this involves quite some changes I've tried to preserve and clarify intermediate steps as much as possible.
Methodology
I used this simple program to run a similarity search on 20666 JPEG images of two different sizes stored on a RAM disk. Each image had at least one duplicate. The profiling program run on macOS 10.12 with 4 logical cores and 16 GB of RAM. Program and framework were build with -Ofast, -mavx2, LTO on and PIE off (main program only).
Baseline
The current baseline code took around 4 minutes and 32 seconds, which is unuasually slow given the number of images and type of hardware. By looking at the CPU profile we see that the program spent most of the time creating new arrays and objects:
This is even more obvious given the fact that the program needed almost 23 GB of memory to complete:
The root problem here is that - [OSSimilaritySearch similarImagesWithProvider:withHashDistanceThreshold:forImageStreamHandler:forResultHandler:] calls [OSImageHashing hashImageData:withProviderId:] for every image and - [OSImageHashing hashDistance:to:withProviderId:] for every pair combination of images. Those methods both call OSImageHashingProviderFromImageHashingProviderId(), which calls down to NSArrayForProvidersFromOSImageHashingProviderId() and picks the first object from the returned array. So for every image and every pair combination of images one array and three image hashing provider objects are created and subsequently discarded. That is 4 n n! of extra time and space whereas we really only need one image hashing object during the whole process. The first step thus is to create the image hashing provider object out of the loop in similarImagesWithProvider:withHashDistanceThreshold:forImageStreamHandler:forResultHandler:. We can also introduce shared instances for hashing providers, just in case. Additionally, this fixes a bug where NSArrayForProvidersFromOSImageHashingProviderId() would return an aHash provider before a pHash provider, contradicting it's definition.
f354cd5 Fix wrong hashing provider order returned from NSArrayForProvidersFromOSImageHashingProviderId()
69e1a2e Use shared instances for hashing providers
3885476 Cache hashing provider out of loop in similarImagesWithProvider:withHashDistanceThreshold:forImageStreamHandler:forResultHandler:
Step 2
The above changes alone drastically reduce execution time down to 39s:
We can see now that there is still a lot of time spent on Objective-C lifecycle calls (objc_retain, objc_release) as well as Objective-C messages in general (objc_msgSend). Our new target is - [NSArray(CocoaImageHashing) arrayWithPairCombinations:withResultHandler:]. This method has a couple of problems.
First, it doesn't cache neither the element count nor the the outer loop's element, resulting in unneccessary Objective-C messages.
Second, every time an object is selected from the array and passed to matcher() or resultHandler(), ARC inserts respective retain/release calls on both sides. But in this case, all those calls are completely unneccessary because the NSArray already holds strong references to it's objects and the only way for any of them to disappear would be if the NSArray was mutated while we iterate it, which would be a programming error in any case. Instructing ARC to not emit retain/release calls for objects that we know can't go away, saves a lot of Objective-C calls. The same applies for NSDictionary iterations.
Third, instead of using single Objective-C calls to get each array object, we can issue one call to get all objects in a C array and then iterate that.
Fourth, while the actual hashing process is multi-threaded, iteration of image pair combinations for comparison currently happens on one thread only. This can be parallized easily, using spin locks to synchronize calling methods' critical sections (see below for a note on the use of spin locks).
At this stage, with - [NSArray(CocoaImageHashing) arrayWithPairCombinations:withResultHandler:] not really returning an array and the matcher() block de facto consisting of a single if statement, we can simplify the interface to align with - [NSArray enumerateObjectsUsingBlock:], thus also saving one extra block call.
f62cd92 Prevent excessive ARC cycles for array members in loops
3b96947 Cache count and objects in arrayWithPairCombinations:withResultHandler: loops
e943bd2 Use C array to iterate over elements
5d0c818 Parallelize enumeration of pair combinations
9243ce1 Simplify interface reducing block calls
Step 3
We can now apply additional minor optimizations, making the inner loop almost completely free of Objective-C calls.
a7a40e2 Replace @synchronized block with spin lock
7b1c6fd Covariant rule is not safe for writable properties
b8eaedd Check image data of input tuple for nil before entering semaphore
9d392aa Inline hamming distance calculation
This brings run time down to about 10 seconds, with around one quarter of that being spent on filesystem traversal:
The remaining changes are of cosmetic nature:
25f1a01 Use os_unfair_lock instead of OSSpinLock on macOS 10.12/iOS 10.0
3ddf2a3 Use OS_UNUSED macro introduced in macOS 10.12/iOS 10.0
9a55067 Move utility macros and helpers to private header
6150f85 Remove OSDHash.h from public headers
Note on spin locks
Spin locks are disputed on iOS because they can lead to priority inversion. 12 In this case, however, the only critical section protected is - [NSMutableArray addObject:], which is so tiny that it's very unlikely to lead to any congestion at all (cmp.). Additionally, 25f1a01 makes use of the solution that Apple has used for a while in the Objective-C runtime and made public in 10.12/10.10.
I've noticed a couple of bottlenecks that make similarity search unusually slow. The following changes should yield a drastic reduction in execution time, in my test case to roughly 4 % compared to the current master branch (i. e. ~25x speed-up), while not drifting into micro-optimization too much. Since this involves quite some changes I've tried to preserve and clarify intermediate steps as much as possible.
Methodology
I used this simple program to run a similarity search on 20666 JPEG images of two different sizes stored on a RAM disk. Each image had at least one duplicate. The profiling program run on macOS 10.12 with 4 logical cores and 16 GB of RAM. Program and framework were build with -Ofast, -mavx2, LTO on and PIE off (main program only).
Baseline
The current baseline code took around 4 minutes and 32 seconds, which is unuasually slow given the number of images and type of hardware. By looking at the CPU profile we see that the program spent most of the time creating new arrays and objects:
This is even more obvious given the fact that the program needed almost 23 GB of memory to complete:
The root problem here is that
- [OSSimilaritySearch similarImagesWithProvider:withHashDistanceThreshold:forImageStreamHandler:forResultHandler:]
calls[OSImageHashing hashImageData:withProviderId:]
for every image and- [OSImageHashing hashDistance:to:withProviderId:]
for every pair combination of images. Those methods both callOSImageHashingProviderFromImageHashingProviderId()
, which calls down toNSArrayForProvidersFromOSImageHashingProviderId()
and picks the first object from the returned array. So for every image and every pair combination of images one array and three image hashing provider objects are created and subsequently discarded. That is 4 n n! of extra time and space whereas we really only need one image hashing object during the whole process. The first step thus is to create the image hashing provider object out of the loop insimilarImagesWithProvider:withHashDistanceThreshold:forImageStreamHandler:forResultHandler:
. We can also introduce shared instances for hashing providers, just in case. Additionally, this fixes a bug whereNSArrayForProvidersFromOSImageHashingProviderId()
would return an aHash provider before a pHash provider, contradicting it's definition.Step 2
The above changes alone drastically reduce execution time down to 39s:
We can see now that there is still a lot of time spent on Objective-C lifecycle calls (
objc_retain
,objc_release
) as well as Objective-C messages in general (objc_msgSend
). Our new target is- [NSArray(CocoaImageHashing) arrayWithPairCombinations:withResultHandler:]
. This method has a couple of problems.First, it doesn't cache neither the element count nor the the outer loop's element, resulting in unneccessary Objective-C messages.
Second, every time an object is selected from the array and passed to
matcher()
orresultHandler()
, ARC inserts respective retain/release calls on both sides. But in this case, all those calls are completely unneccessary because the NSArray already holds strong references to it's objects and the only way for any of them to disappear would be if the NSArray was mutated while we iterate it, which would be a programming error in any case. Instructing ARC to not emit retain/release calls for objects that we know can't go away, saves a lot of Objective-C calls. The same applies for NSDictionary iterations.Third, instead of using single Objective-C calls to get each array object, we can issue one call to get all objects in a C array and then iterate that.
Fourth, while the actual hashing process is multi-threaded, iteration of image pair combinations for comparison currently happens on one thread only. This can be parallized easily, using spin locks to synchronize calling methods' critical sections (see below for a note on the use of spin locks).
At this stage, with
- [NSArray(CocoaImageHashing) arrayWithPairCombinations:withResultHandler:]
not really returning an array and thematcher()
block de facto consisting of a singleif
statement, we can simplify the interface to align with- [NSArray enumerateObjectsUsingBlock:]
, thus also saving one extra block call.Step 3
We can now apply additional minor optimizations, making the inner loop almost completely free of Objective-C calls.
@synchronized
block with spin lockThis brings run time down to about 10 seconds, with around one quarter of that being spent on filesystem traversal:
The remaining changes are of cosmetic nature:
os_unfair_lock
instead ofOSSpinLock
on macOS 10.12/iOS 10.0OS_UNUSED
macro introduced in macOS 10.12/iOS 10.0Note on spin locks
Spin locks are disputed on iOS because they can lead to priority inversion. 1 2 In this case, however, the only critical section protected is
- [NSMutableArray addObject:]
, which is so tiny that it's very unlikely to lead to any congestion at all (cmp.). Additionally, 25f1a01 makes use of the solution that Apple has used for a while in the Objective-C runtime and made public in 10.12/10.10.