VectorSearchResponseError: Number of query ids (200) and knn matches (0) does not match.

sentry-io[bot] commented 7 months ago

VectorSearchResponseError: Number of query ids (200) and knn matches (0) does not match. This could probably be caused by Vector Search overload.
(11 additional frame(s) were not displayed)
...
  File "casp/services/api/routers/cell_operations_router.py", line 34, in annotate
    return await cell_operations_service.annotate_adata_file(
  File "casp/services/api/services/cell_operations_service.py", line 166, in annotate_adata_file
    knn_response = self.get_knn_matches(embeddings=embeddings, model_name=model_name)
  File "casp/services/api/services/cell_operations_service.py", line 103, in get_knn_matches
    self.__validate_knn_response(embeddings=embeddings, knn_response=matches)
  File "casp/services/api/services/cell_operations_service.py", line 80, in __validate_knn_response
    raise exceptions.VectorSearchResponseError(

fedorgrab commented 7 months ago

After the conversation with the Vertex AI team, it seems that splitting the embedding array into smaller batches (5-20 cells) before submitting it to Vector Search, can significantly improve throughput and resolve this issue.

Note: The Vertex AI team mentioned that the number of nearest neighbors (the value set during index creation) should represent the total number of neighbors per request that the index searches for. This means that if we aim to search for 100 neighbors and we have 5 cells per request, the index would search for a total of 500 neighbors. If the current request exceeds the maximum number of neighbors allowed for search, it triggers a switch to the brute force algorithm, significantly decreasing performance.

fedorgrab commented 6 months ago

Index parameters that we need to test with smaller batches:

[x] approximate_neighbors_count (in index create stage, not search). Total number of neighbors in each query should not exceed this number (meaning for a batch of 20 cells with 100 neighbors we need at least approximate_neighbors_count=2000)
[x] Autoscaling nodes
[ ] leaf_node_size
[ ] shard size

evolvedmicrobe commented 6 months ago

Looking at the configuration it looks like leafNodesToSearchPercent could be a pretty important parameter. It looks like by default it's using 10%, which I think means every barcode is being compared to up to 3.3M cells and it feels like that has to slow things way down if so. Do you know if there's a reason they didn't recommend changing this?

evolvedmicrobe commented 6 months ago

Also, should we change SHARD_SIZE_XXX? Not sure what it's set to at the moment.

KevinCLydon commented 6 months ago

Quick progress update from yesterday:

Switched from our custom matching engine endpoint code to the default Google version to see if the issue was a result of something in the custom code, but that didn't make any observable difference.
Merged but not yet tested backoff/retry code with the code for reducing the size of the batches submitted to Vertex AI.
Next on the agenda is testing and tweaking the retry-small-batches code to see if retrying small failed batches gets us to an acceptable reliability level.

fedorgrab commented 6 months ago

Also, should we change SHARD_SIZE_XXX? Not sure what it's set to at the moment.

Can you please clarify, what is SHARD_SIZE_XXX? Where can we find this parameter?

evolvedmicrobe commented 6 months ago

@fedorgrab See this section on SHARD_SIZE it sounds like you are required to specify this during index creation but I didn't see us specify it anywhere. It seems to set the size of the instance type used by the vector search (which I also had trouble figuring out what we were using there, do you know? Is it picked by default somehow?)

fedorgrab commented 6 months ago

@fedorgrab See this section on SHARD_SIZE it sounds like you are required to specify this during index creation but I didn't see us specify it anywhere. It seems to set the size of the instance type used by the vector search (which I also had trouble figuring out what we were using there, do you know? Is it picked by default somehow?)

Shard size gets assigned automatically. Machine type is either default or also automatic.

evolvedmicrobe commented 6 months ago

Shard size gets assigned automatically. Machine type is either default or also automatic.

Hmmm... the support docs say When you create an index, you must specify the size of the shards to use and The machine types that you can use to deploy your index ...depends on the shard size of the index. Which makes it sound like that's not the case.

I'm wondering if by not specify a SHARD_SIZE it is defaulting to something sub-optimal like a SHARD_SIZE_SMALL and so limiting throughput. Did you learn from somewhere else how this was all being established? And would it be worth trying a SHARD_SIZE_LARGE to see if that could help unblock stuff?

fedorgrab commented 6 months ago

Shard size gets assigned automatically. Machine type is either default or also automatic.

Hmmm... the support docs say When you create an index, you must specify the size of the shards to use and The machine types that you can use to deploy your index ...depends on the shard size of the index. Which makes it sound like that's not the case.

I'm wondering if by not specify a SHARD_SIZE it is defaulting to something sub-optimal like a SHARD_SIZE_SMALL and so limiting throughput. Did you learn from somewhere else how this was all being established? And would it be worth trying a SHARD_SIZE_LARGE to see if that could help unblock stuff?

I think increasing shard size will reduce the number of shards (because the capacity for each shard will become larger), thus we would get only a smaller throughput.

Our current shard_size is medium you can check out one of the indexes that we have: https://console.cloud.google.com/vertex-ai/locations/us-central1/indexes/766284837769183232/deployments?project=dsp-cell-annotation-service

fedorgrab commented 6 months ago

Shard size gets assigned automatically. Machine type is either default or also automatic.

Hmmm... the support docs say When you create an index, you must specify the size of the shards to use and The machine types that you can use to deploy your index ...depends on the shard size of the index. Which makes it sound like that's not the case.

I'm wondering if by not specify a SHARD_SIZE it is defaulting to something sub-optimal like a SHARD_SIZE_SMALL and so limiting throughput. Did you learn from somewhere else how this was all being established? And would it be worth trying a SHARD_SIZE_LARGE to see if that could help unblock stuff?

Also, according to the documents you sent previously, it is stated which default shard size is assigned to each machine type.

fedorgrab commented 6 months ago

@evolvedmicrobe But anyway, I added shard size to the list of things to try out.

10xjeff commented 6 months ago

@fedorgrab @KevinCLydon any news from stress testing yesterday?

KevinCLydon commented 6 months ago

@10xjeff No big updates from my end, unfortunately. I'm still tweaking some of the retry logic and the batch sizes to see if I can get the error rate down.

fedorgrab commented 6 months ago

@10xjeff, we experimented with the approximate_neighbors_count, autoscaling node count parameters, and batching. Here are the findings:

Increasing approximate_neighbors_count to 10k was counterproductive; it worsened performance. We may consider testing smaller values, such as 1k or 5k.
Autoscaling the maximum number of nodes didn't significantly improve performance for the initial dataset since we started with only one instance. However, it showed improvements for subsequent submissions.
Autoscaling the minimum number of nodes appears to positively impact even the first batch submitted. This is an effective, albeit costly, solution.
Batching alone was not beneficial. Yet, introducing batches significantly increased the qps (queries per second) from approximately 0.4 to over 50-60. Despite this improvement, the benefit seems marginal because, although we're processing more queries, the number of cells per query has significantly decreased, essentially bringing us back to square one. It feels like there might be an optimal balance between batch size and approximate_neighbors_count.

Insights:

The system takes approximately 7-10 minutes to scale.

Conclusion: Batching and adjusting approximate_neighbors_count may enhance throughput but don't address failure issues. To make the system resilient to our traffic, implementing retry logic for vector searches and/or queuing is necessary.

We also plan to further explore adjustments to leaf_node_size and shard size.

evolvedmicrobe commented 6 months ago

Yet, introducing batches significantly increased the qps (queries per second) from approximately 0.4 to over 50-60..

That feels like an unexpectedly low number. For something to compare against, I took the human-pca-10x-only-512-log1p-v1 dataset and loaded it with the ScaNN library following Google's Example . To just test what kind of QPS numbers we might get.

# Made a scann searcher
searcher = scann.scann_ops_pybind.builder(numpy_matrix[:,1:], 100, "dot_product").tree(
    num_leaves=5359, num_leaves_to_search=10, training_sample_size=250000).score_ah(
    2, anisotropic_quantization_threshold=0.2).reorder(100).build()
# Benchmarked it
start = time.time()
neighbors, distances = searcher.search_batched(numpy_matrix[:100, 1:], leaves_to_search=1000)
end = time.time()
print("1000 Leaves Time w/ 100 Neighbors QPS:", neighbors.shape[0] / float(end - start))
start = time.time()
neighbors, distances = searcher.search_batched(numpy_matrix[:100, 1:], leaves_to_search=150)
end = time.time()
print("150 Leaves Time w/ 100 Neighbors QPS:", neighbors.shape[0] / float(end - start))

Result:

1000 Leaves Time w/ 100 Neighbors QPS: 45.19609686608152
150 Leaves Time w/ 100 Neighbors QPS: 256.2399121247555
150 Leaves Time w/ 50 Neighbors: QPS 271.10153354525204

And on our machine I was getting >45 QPS even when searching through 1000 leaves. Which makes me think we should be getting faster results, it might be worth telling Google just how bad the QPS is and seeing if they have any further ideas. I think fiddling with the partitioning parameters should hopefully help a lot though.

KevinCLydon commented 6 months ago

Quick-ish update: Did some testing yesterday with different batch sizes and retry logic and had some success in reducing error count, but it's hard to tell if the reduction is actually a result of the index having scaled up before my successful tests. Today, I put together a script to run several tests with different configurations of batch sizes, file sizes, and retry params and print some info on run duration and exceptions to a CSV. I'm gonna run it probably overnight tonight and then check Monday to see what changes had the most effect. I'll probably have to cross reference with some of the error reporting and activity monitoring in the cloud console (that stuff doesn't all seem to be exposed to the REST API or python SDK, unfortunately). I'll report my findings on Monday.

We also have another meeting with Google Monday afternoon, so we'll hopefully get some useful info out of that.

KevinCLydon commented 6 months ago

Another quick update: Seeing performance improvements and significant reduction in error counts with small batches + retry change. I have a PR for this right now that is being reviewed and iterated on, so we should be pretty close to getting those changes in.

cellarium-ai / cellarium-cloud

VectorSearchResponseError: Number of query ids (200) and knn matches (0) does not match. #127