cellarium-ai / cellarium-cloud

Cellarium Cloud Core Library
BSD 3-Clause "New" or "Revised" License
6 stars 1 forks source link

VectorSearchResponseError: Number of query ids (200) and knn matches (0) does not match. #127

Closed sentry-io[bot] closed 6 months ago

sentry-io[bot] commented 7 months ago

Sentry Issue: CELLARIUM-CLOUD-10

VectorSearchResponseError: Number of query ids (200) and knn matches (0) does not match. This could probably be caused by Vector Search overload.
(11 additional frame(s) were not displayed)
...
  File "casp/services/api/routers/cell_operations_router.py", line 34, in annotate
    return await cell_operations_service.annotate_adata_file(
  File "casp/services/api/services/cell_operations_service.py", line 166, in annotate_adata_file
    knn_response = self.get_knn_matches(embeddings=embeddings, model_name=model_name)
  File "casp/services/api/services/cell_operations_service.py", line 103, in get_knn_matches
    self.__validate_knn_response(embeddings=embeddings, knn_response=matches)
  File "casp/services/api/services/cell_operations_service.py", line 80, in __validate_knn_response
    raise exceptions.VectorSearchResponseError(
fedorgrab commented 7 months ago

After the conversation with the Vertex AI team, it seems that splitting the embedding array into smaller batches (5-20 cells) before submitting it to Vector Search, can significantly improve throughput and resolve this issue.

Note: The Vertex AI team mentioned that the number of nearest neighbors (the value set during index creation) should represent the total number of neighbors per request that the index searches for. This means that if we aim to search for 100 neighbors and we have 5 cells per request, the index would search for a total of 500 neighbors. If the current request exceeds the maximum number of neighbors allowed for search, it triggers a switch to the brute force algorithm, significantly decreasing performance.

fedorgrab commented 6 months ago

Index parameters that we need to test with smaller batches:

evolvedmicrobe commented 6 months ago

Looking at the configuration it looks like leafNodesToSearchPercent could be a pretty important parameter. It looks like by default it's using 10%, which I think means every barcode is being compared to up to 3.3M cells and it feels like that has to slow things way down if so. Do you know if there's a reason they didn't recommend changing this?

evolvedmicrobe commented 6 months ago

Also, should we change SHARD_SIZE_XXX? Not sure what it's set to at the moment.

KevinCLydon commented 6 months ago

Quick progress update from yesterday:

fedorgrab commented 6 months ago

Also, should we change SHARD_SIZE_XXX? Not sure what it's set to at the moment.

Can you please clarify, what is SHARD_SIZE_XXX? Where can we find this parameter?

evolvedmicrobe commented 6 months ago

@fedorgrab See this section on SHARD_SIZE it sounds like you are required to specify this during index creation but I didn't see us specify it anywhere. It seems to set the size of the instance type used by the vector search (which I also had trouble figuring out what we were using there, do you know? Is it picked by default somehow?)

fedorgrab commented 6 months ago

@fedorgrab See this section on SHARD_SIZE it sounds like you are required to specify this during index creation but I didn't see us specify it anywhere. It seems to set the size of the instance type used by the vector search (which I also had trouble figuring out what we were using there, do you know? Is it picked by default somehow?)

Shard size gets assigned automatically. Machine type is either default or also automatic.

evolvedmicrobe commented 6 months ago

Shard size gets assigned automatically. Machine type is either default or also automatic.

Hmmm... the support docs say When you create an index, you must specify the size of the shards to use and The machine types that you can use to deploy your index ...depends on the shard size of the index. Which makes it sound like that's not the case.

I'm wondering if by not specify a SHARD_SIZE it is defaulting to something sub-optimal like a SHARD_SIZE_SMALL and so limiting throughput. Did you learn from somewhere else how this was all being established? And would it be worth trying a SHARD_SIZE_LARGE to see if that could help unblock stuff?

fedorgrab commented 6 months ago

Shard size gets assigned automatically. Machine type is either default or also automatic.

Hmmm... the support docs say When you create an index, you must specify the size of the shards to use and The machine types that you can use to deploy your index ...depends on the shard size of the index. Which makes it sound like that's not the case.

I'm wondering if by not specify a SHARD_SIZE it is defaulting to something sub-optimal like a SHARD_SIZE_SMALL and so limiting throughput. Did you learn from somewhere else how this was all being established? And would it be worth trying a SHARD_SIZE_LARGE to see if that could help unblock stuff?

I think increasing shard size will reduce the number of shards (because the capacity for each shard will become larger), thus we would get only a smaller throughput.

Our current shard_size is medium you can check out one of the indexes that we have: https://console.cloud.google.com/vertex-ai/locations/us-central1/indexes/766284837769183232/deployments?project=dsp-cell-annotation-service

fedorgrab commented 6 months ago

Shard size gets assigned automatically. Machine type is either default or also automatic.

Hmmm... the support docs say When you create an index, you must specify the size of the shards to use and The machine types that you can use to deploy your index ...depends on the shard size of the index. Which makes it sound like that's not the case.

I'm wondering if by not specify a SHARD_SIZE it is defaulting to something sub-optimal like a SHARD_SIZE_SMALL and so limiting throughput. Did you learn from somewhere else how this was all being established? And would it be worth trying a SHARD_SIZE_LARGE to see if that could help unblock stuff?

Also, according to the documents you sent previously, it is stated which default shard size is assigned to each machine type.

fedorgrab commented 6 months ago

@evolvedmicrobe But anyway, I added shard size to the list of things to try out.

10xjeff commented 6 months ago

@fedorgrab @KevinCLydon any news from stress testing yesterday?

KevinCLydon commented 6 months ago

@10xjeff No big updates from my end, unfortunately. I'm still tweaking some of the retry logic and the batch sizes to see if I can get the error rate down.

fedorgrab commented 6 months ago

@10xjeff, we experimented with the approximate_neighbors_count, autoscaling node count parameters, and batching. Here are the findings:

Insights:

Conclusion: Batching and adjusting approximate_neighbors_count may enhance throughput but don't address failure issues. To make the system resilient to our traffic, implementing retry logic for vector searches and/or queuing is necessary.

We also plan to further explore adjustments to leaf_node_size and shard size.

evolvedmicrobe commented 6 months ago

Yet, introducing batches significantly increased the qps (queries per second) from approximately 0.4 to over 50-60..

That feels like an unexpectedly low number. For something to compare against, I took the human-pca-10x-only-512-log1p-v1 dataset and loaded it with the ScaNN library following Google's Example . To just test what kind of QPS numbers we might get.

# Made a scann searcher
searcher = scann.scann_ops_pybind.builder(numpy_matrix[:,1:], 100, "dot_product").tree(
    num_leaves=5359, num_leaves_to_search=10, training_sample_size=250000).score_ah(
    2, anisotropic_quantization_threshold=0.2).reorder(100).build()
# Benchmarked it
start = time.time()
neighbors, distances = searcher.search_batched(numpy_matrix[:100, 1:], leaves_to_search=1000)
end = time.time()
print("1000 Leaves Time w/ 100 Neighbors QPS:", neighbors.shape[0] / float(end - start))
start = time.time()
neighbors, distances = searcher.search_batched(numpy_matrix[:100, 1:], leaves_to_search=150)
end = time.time()
print("150 Leaves Time w/ 100 Neighbors QPS:", neighbors.shape[0] / float(end - start))

Result:

1000 Leaves Time w/ 100 Neighbors QPS: 45.19609686608152
150 Leaves Time w/ 100 Neighbors QPS: 256.2399121247555
150 Leaves Time w/ 50 Neighbors: QPS 271.10153354525204

And on our machine I was getting >45 QPS even when searching through 1000 leaves. Which makes me think we should be getting faster results, it might be worth telling Google just how bad the QPS is and seeing if they have any further ideas. I think fiddling with the partitioning parameters should hopefully help a lot though.

KevinCLydon commented 6 months ago

Quick-ish update: Did some testing yesterday with different batch sizes and retry logic and had some success in reducing error count, but it's hard to tell if the reduction is actually a result of the index having scaled up before my successful tests. Today, I put together a script to run several tests with different configurations of batch sizes, file sizes, and retry params and print some info on run duration and exceptions to a CSV. I'm gonna run it probably overnight tonight and then check Monday to see what changes had the most effect. I'll probably have to cross reference with some of the error reporting and activity monitoring in the cloud console (that stuff doesn't all seem to be exposed to the REST API or python SDK, unfortunately). I'll report my findings on Monday.

We also have another meeting with Google Monday afternoon, so we'll hopefully get some useful info out of that.

KevinCLydon commented 6 months ago

Another quick update: Seeing performance improvements and significant reduction in error counts with small batches + retry change. I have a PR for this right now that is being reviewed and iterated on, so we should be pretty close to getting those changes in.