Vector efficient math operations for KNN distance functions

Tradunsky commented 1 year ago

Description

Thank you very much for the diversity of choices when it comes to dense vector search operations 🙏🏻

In my private measures on float32 768 vector dimensions, somehow, Elasticsearch exact KNN cosine similarity function search is slower than in OpenSearch project (apologies if that comparison is inappropriate here, mainly want Elasticsearch to improve to capture NLP cases wave in AWS cloud).

KNN Dense Vector uses Lucene VectorUtil.dot_product calculation for cosine distance function, which seems to take into account one of the possible optimizations 💪🏻

I wonder if cosine and other distance computation functions in Elasticsearch can benefit from java support of vector optimal operations? https://openjdk.org/jeps/417

Thank you!

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-search (Team:Search)

jdconrad commented 1 year ago

Hi @Tradunsky! I was curious if you'd be able to share some of your setup here for how you're doing your measurements? In particular I'm curious if you're using indexed or non-indexed vectors for your exact vector search scripts.

We are indeed interested in the performance gains afforded by the jdk.incubator.vector API; however, given the lack of stability of a project in incubation there are significant difficulties in building this into Elasticsearch.

Tradunsky commented 1 year ago

Hi @jdconrad,

Thank you for the replay!

particular I'm curious if you're using indexed or non-indexed vectors for your exact vector search scripts.

Non indexed.

Index definition

"settings": {
        "index": {
            "number_of_replicas": 0,
            "number_of_shards": 7,
        }
    },
    "mappings": {
        "properties": {
            "entity_id": {"type": "keyword"},
            "embedding": {
                "type": "dense_vector",
                "dims": 768,
                "index": False,
             }
         }
  }

Search query


        "_source": False,
        "query": {
            "script_score": {
                "query": {
                    "bool": {
                        "filter": {
                            "term": {
                                "entity_id": entity_id
                            }
                        }
                    }
                },
                "script": {
                    "source": "cosineSimilarity(params.query_vector, 'embedding') + 1.0",
                    "params": {
                        "query_vector": vector
                    }
                }
            }
        }
    }

Data volumes: 1M records for 1 entity. 1M docs total

Cluster spec:

jdconrad commented 1 year ago

@Tradunsky Thanks for the info! Non-indexed makes sense to me as we used a different strategy for storage in that case (as I'm guessing you've seen since you've pointed out lines of code! :) ). I will take a look into what we can do here for unrolling to improve performance for this use case.

Tradunsky commented 1 year ago

Thank you @jdconrad much appreciate your time 🙏🏻

jdconrad commented 1 year ago

I wanted to update this issue as it currently stands:

@grcevski helped me profile the non-indexed path and we came to the conclusion that loop unrolling is already occurring to some extent and that the real bottleneck appears to be unwrapping floats from the binary data used for non-indexed vectors. I will continue to look to see if we can improve performance in some other more creative way.

Tradunsky commented 1 year ago

@jdconrad Thank you very much for looking at this 🙏🏻

I was about to provide an update as well, but did not mean to pull the ticket into a different direction.

While exploring pre-filter with ANN and after profiling the query much carefully, I realized that at least in ANN, distance computation is not the bottleneck (378.8micros), here is the shard profile that takes the longest (for ANN):

```json { 'id': '[NAW19kkiRIiOgwobsJZTWQ][index-name-eval][6]', 'searches': [ { 'query': [ { 'type': 'BooleanQuery', 'description': 'ScoreAndDocQuery (ConstantScore(entity_id:1234567890abcdefg))^0.0', 'time': '94.9ms', 'time_in_nanos': 94904769, 'breakdown': { 'set_min_competitive_score_count': 0, 'match_count': 2524, 'shallow_advance_count': 0, 'set_min_competitive_score': 0, 'next_doc': 69727725, 'match': 4320484, 'next_doc_count': 2638, 'score_count': 2524, 'compute_max_score_count': 0, 'compute_max_score': 0, 'advance': 3914115, 'advance_count': 20, 'score': 13072182, 'build_scorer_count': 40, 'create_weight': 28807, 'shallow_advance': 0, 'create_weight_count': 1, 'build_scorer': 3841456 }, 'children': [ { 'type': 'KnnScoreDocQuery', 'description': 'ScoreAndDocQuery', 'time': '378.8micros', 'time_in_nanos': 378884, 'breakdown': { 'set_min_competitive_score_count': 0, 'match_count': 0, 'shallow_advance_count': 58, 'set_min_competitive_score': 0, 'next_doc': 0, 'match': 0, 'next_doc_count': 0, 'score_count': 1, 'compute_max_score_count': 58, 'compute_max_score': 92290, 'advance': 36632, 'advance_count': 21, 'score': 2473, 'build_scorer_count': 60, 'create_weight': 1685, 'shallow_advance': 107486, 'create_weight_count': 1, 'build_scorer': 138318 } }, { 'type': 'BoostQuery', 'description': '(ConstantScore(entity_id:1234567890abcdefg))^0.0', 'time': '18.1ms', 'time_in_nanos': 18109744, 'breakdown': { 'set_min_competitive_score_count': 0, 'match_count': 0, 'shallow_advance_count': 58, 'set_min_competitive_score': 0, 'next_doc': 0, 'match': 0, 'next_doc_count': 0, 'score_count': 2524, 'compute_max_score_count': 58, 'compute_max_score': 120726, 'advance': 13014199, 'advance_count': 2657, 'score': 4017757, 'build_scorer_count': 58, 'create_weight': 11721, 'shallow_advance': 93150, 'create_weight_count': 1, 'build_scorer': 852191 }, 'children': [ { 'type': 'TermQuery', 'description': 'entity_id:1234567890abcdefg', 'time': '4.7ms', 'time_in_nanos': 4722106, 'breakdown': { 'set_min_competitive_score_count': 0, 'match_count': 0, 'shallow_advance_count': 0, 'set_min_competitive_score': 0, 'next_doc': 0, 'match': 0, 'next_doc_count': 0, 'score_count': 0, 'compute_max_score_count': 0, 'compute_max_score': 0, 'advance': 4137499, 'advance_count': 2657, 'score': 0, 'build_scorer_count': 58, 'create_weight': 3970, 'shallow_advance': 0, 'create_weight_count': 1, 'build_scorer': 580637 } } ] } ] } ], 'rewrite_time': 16717, 'collector': [ { 'name': 'SimpleTopScoreDocCollector', 'reason': 'search_top_hits', 'time': '10ms', 'time_in_nanos': 10047191 } ] } ], 'aggregations': [], 'fetch': { 'type': 'fetch', 'description': '', 'time': '3ms', 'time_in_nanos': 3019350, 'breakdown': { 'load_stored_fields': 695081, 'load_source': 1924, 'load_stored_fields_count': 1, 'next_reader_count': 1, 'load_source_count': 1, 'next_reader': 11381 }, 'debug': { 'stored_fields': [ '_id', '_routing', '_source' ] }, 'children': [ { 'type': 'FetchSourcePhase', 'description': '', 'time': '2.2ms', 'time_in_nanos': 2206462, 'breakdown': { 'process_count': 1, 'process': 2203932, 'next_reader': 2530, 'next_reader_count': 1 }, 'debug': { 'fast_path': 0 } } ] } } ```

I could not get recall not even close to exact KNN, but I probably should first look at the entity filtration strategies I can take as the overall goal is to achieve E2E p50 latency under 100ms for an entity docs count around 500k - 1M. Local implementation showed it is feasible, but data changes scale is the task I was hoping to solve with Elasticsearch.

Just want to verify with you if my judgment about long time for the data filtration phase is fair here as like you mentioned before, it may be affected by a different structure of the index when hnsw graph applied (index:true)?

Very much appreciate your help and time on this folks 🙏🏻

jdconrad commented 1 year ago

@Tradunsky I apologize for the delayed response to your question here as I was on holiday for a bit.

I find it interesting that your recall wasn't very good with ANN. I wonder if a higher num_candidates would help here or is it possible that your prefilter is filtering too many documents? I would be especially curious what a profile for ANN looks like without a filter. None of the profile specifically stands out to me.

For your use cases, I would guess that using an HNSW graph (index:true) would likely better meet your latency requirements with the caveat that you have enough RAM to dedicate to the HNSW data structure.

Tradunsky commented 1 year ago

Hi @jdconrad, Sorry, somehow missed your replay.

Ok, so by switching from hybrid prefilter to during prefilter, I got accuracy jump to the level almost the same as exact KNN. I have not get a chance to dig into the root cause, but document filter result should be exactly about 1M.

Thank you very much for the replays and support! Please feel free to close the ticket as I think ANN and dimensionality reduction are the ways to go in such cases.

jdconrad commented 1 year ago

@Tradunsky Thanks for the update, and I'm glad you were able to find a solution that works for you! Closing this now.

Tradunsky commented 1 year ago

Linking this here for better visibility: https://github.com/apache/lucene/issues/12091

Tradunsky commented 1 year ago

Great job on incorporating SIMD instructions into 8.9.0! It dropped latencies by at least 50% 💪🏻

Much appreciate work on that folks 🙏🏻

elastic / elasticsearch

Vector efficient math operations for KNN distance functions #92260

Description