Insight Data Engineering Project, Boston, April - June 2018
April 2020
DEPRECATED -- please see the improved implemenation: https://github.com/alexklibisz/elastiknn
January 2020
I've made the improved implementation public: https://github.com/alexklibisz/elastiknn
November 2019
I am actively working on an improved implementation of this plugin. I hope to open-source and release it late 2019 or early 2020
Image similarity search demo running searches on a cluster of 4 Elasticsearch nodes |
I built Elasticsearch-Aknn (EsAknn), an Elasticsearch plugin which implements approximate K-nearest-neighbors search for dense, floating-point vectors in Elasticsearch. This allows data engineers to avoid rebuilding an infrastructure for large-scale KNN and instead leverage Elasticsearch's proven distributed infrastructure.
To demonstrate the plugin, I used it to implement an image similarity search engine for a corpus of 6.7 million Twitter images. I transformed each image into a 1000-dimensional floating-point feature vector using a convolutional neural network. I used EsAknn to store the vectors and search for nearest neighbors on an Elasticsearch cluster.
The repository is structured:
demo
directory: Twitter Image Similarity search pipeline and web-applicationelasticsearch-aknn
directory: EsAknn implementation and benchmarksscratch
directory: several smaller projects implemented while prototypingEsAknn is useful for problems roughly characterized:
Tldr: If you need to quickly run KNN on an extremely large corpus in an offline job, use one of the libraries from Ann-Benchmarks. If you need KNN in an online setting with support for horizontally-scalable searching and indexing new vectors in near-real-time, consider EsAknn (especially if you already use Elasticsearch).
There are about a dozen high-quality open-source approximate-nearest-neighbors libraries. The Ann-Benchmarks project is a great place to compare them. Most of them take a large corpus of vectors, build an index, and expose an interface to run very fast nearest-neighbors search on that fixed corpus.
Unfortunately they offer very little infrastructure for deploying your nearest-neighbors search in an online setting. Specifically, you still have to consider:
Elasticsearch already solves the non-trivial infrastrcture problems, and EsAknn implements approximate nearest-neighbors indexing and search atop this proven infrastructure.
EsAknn's LSH implementation is very simplistic in the grand scheme of approximate-nearest-neighbors approaches, but it maps well to Elasticsearch and still yields high recall. EsAknn's speed for serial queries is much slower than other approximate nearest-neighbor libraries, but it's also not designed for serial querying. Instead it's designed to serve many concurrent searches over a convenient HTTP endpoint, index new vectors in near-real-time, and scale horizontally with Elasticsearch. For specific performance numbers, see the performance section below and the slides linked in the demo section.
Given a sample of vectors, create a locality-sensitive-hashing (LSH) model and store it as an Elasticsearch document.
POST <elasticsearch host>:9200/_aknn_create
{
"_index": "aknn_models",
"_type": "aknn_model",
"_id": "twitter_image_search",
"_source": {
"_aknn_description": "LSH model for Twitter image similarity search",
"_aknn_nb_tables": 64,
"_aknn_nb_bits_per_table": 18,
"_aknn_nb_dimensions": 1000
},
"_aknn_vector_sample": [
# Provide a sample of 2 * _aknn_nb_tables * _aknn_nb_bits_per_table vectors
[0.11, 0.22, ...],
[0.22, 0.33, ...],
...
[0.88, 0.99, ...]
]
}
This returns:
{ "took": <number of milliseconds> }
Given a batch of new vectors, hash each vector using a pre-defined LSH model and store its raw and hashed values in an Elasticsearch document.
POST <elasticsearch host>:9200/_aknn_index
{
"_index": "twitter_images",
"_type": "twitter_image",
"_aknn_uri": "aknn_models/aknn_model/twitter_image_search"
"_aknn_docs": [
{
"_id": 1,
"_source": {
"_aknn_vector": [0.12, 0.23, ...],
# Any other fields you want...
}
}, ...
]
}
This returns:
{ "took": <number of milliseconds>, "size": <number of documents indexed> }
Given a vector in the index, search for and return its nearest neighbors.
GET <elasticsearch host>:9200/twitter_images/twitter_image/1/_aknn_search?k1=1000&k2=10
This returns:
{
"took": <number of milliseconds>,
"timed_out": false,
"hits": {
"max_score": 0,
"total": <number of hits returned, up to k2>,
"hits": [
{
"_id": "...",
'_index': "twitter_images",
"_score": <euclidean distance from query vector to this vector>,
'_source': {
# All of the document fields except for the potentially
# large fields containing the vector and hashes.
}
}, ...
]
}
}
The key things to know about the implementation are:
k1
approximate nearest neighbors based on discrete hashes. It then
computes the exact distance to each of these approximate neighbors and returns
the k2
closest. For example, you might set k1 = 1000
and k2 = 10
.EsAknn's speed is generally characterized:
The corpus vs. search time generally follows a sub-linear pattern like this:
<img src="elasticsearch-aknn/benchmark/metrics/fig_corpus_vs_time.png" height=300 width=auto/>
Beyond that, speed is a function of:
k1
. k2
.In the image similarity search engine, you can see that searches against an index of 6.7 million 1000-dimensional vectors rarely exceed 200 milliseconds.
Recall is defined as the proportion of true nearest neighbors returned for
a search and can be evaluated at various values of k2
. For example, if
you know your application needs to retrieve the top ten most similar items,
you should evaluate recall at k2 = 10
.
Similar to speed, recall depends on the LSH configuration. Increasing k1
is typically the easiest way to increase recall, but the number of tables and
bits also play an important role. Finding a configuration to maximize
recall and minimize search time can be considered a form of hyper-parameter
optimization.
The figure below demonstrates that it is possible to find a configuration with high-recall and low search-time at various corpus sizes. The points plotted represent the "frontier" of recall/search-time. That is, I ran benchmarks on many configurations and chose the configurations with the lowest median search time for each median recall across three corpus sizes.
<img src="elasticsearch-aknn/benchmark/metrics/fig_recall_vs_time.png" height=350 width=auto/>
The table below shows the best configuration for each combination of corpus size, median recall, median search time with a median recall >= 0.5.
Corpus size | Med. recall | Med. search time | k1 | _aknn_nb_tables | _aknn_nb_bits_per_table | |
---|---|---|---|---|---|---|
0 | 1000000 | 1 | 191 | 500 | 200 | 12 |
1 | 1000000 | 0.9 | 100 | 500 | 100 | 14 |
2 | 1000000 | 0.8 | 62 | 1000 | 50 | 16 |
3 | 1000000 | 0.7 | 49 | 500 | 50 | 16 |
4 | 1000000 | 0.6 | 43 | 250 | 50 | 16 |
5 | 1000000 | 0.5 | 50 | 250 | 50 | 19 |
6 | 100000 | 1 | 26 | 250 | 100 | 12 |
7 | 100000 | 0.9 | 21 | 500 | 50 | 14 |
8 | 100000 | 0.8 | 14 | 250 | 50 | 18 |
9 | 100000 | 0.7 | 11 | 100 | 50 | 14 |
10 | 100000 | 0.6 | 11 | 100 | 50 | 19 |
11 | 100000 | 0.5 | 14 | 500 | 10 | 8 |
12 | 10000 | 1 | 8 | 100 | 100 | 8 |
13 | 10000 | 0.9 | 5 | 100 | 50 | 12 |
14 | 10000 | 0.8 | 5 | 100 | 50 | 18 |
15 | 10000 | 0.7 | 6 | 250 | 10 | 8 |
16 | 10000 | 0.6 | 6 | 15 | 100 | 18 |
17 | 10000 | 0.5 | 3 | 15 | 50 | 14 |
The image processing pipeline consists of the following components, shown in pink and green above:
conv_pred
layer from
Keras pre-trained MobileNet
to compute the 1000-dimensional feature vectors. Image feature extraction is the main bottleneck in this pipeline. It's embarrassingly parallel but still requires thoughtful optimization. In the end I was able to compute:
My first-pass plateaued at about 2 images / node / second. I was able to improve throughput with the following optimizations:
Here are a handful of resources I found particularly helpful for this project: