Closed sentry-io[bot] closed 6 months ago
After the conversation with the Vertex AI team, it seems that splitting the embedding array into smaller batches (5-20 cells) before submitting it to Vector Search, can significantly improve throughput and resolve this issue.
Note: The Vertex AI team mentioned that the number of nearest neighbors (the value set during index creation) should represent the total number of neighbors per request that the index searches for. This means that if we aim to search for 100 neighbors and we have 5 cells per request, the index would search for a total of 500 neighbors. If the current request exceeds the maximum number of neighbors allowed for search, it triggers a switch to the brute force algorithm, significantly decreasing performance.
Index parameters that we need to test with smaller batches:
approximate_neighbors_count
(in index create stage, not search). Total number of neighbors in each query should not exceed this number (meaning for a batch of 20 cells with 100 neighbors we need at least approximate_neighbors_count=2000
)leaf_node_size
Looking at the configuration it looks like leafNodesToSearchPercent
could be a pretty important parameter. It looks like by default it's using 10%, which I think means every barcode is being compared to up to 3.3M cells and it feels like that has to slow things way down if so. Do you know if there's a reason they didn't recommend changing this?
Also, should we change SHARD_SIZE_XXX
? Not sure what it's set to at the moment.
Quick progress update from yesterday:
Also, should we change
SHARD_SIZE_XXX
? Not sure what it's set to at the moment.
Can you please clarify, what is SHARD_SIZE_XXX
? Where can we find this parameter?
@fedorgrab See this section on SHARD_SIZE it sounds like you are required to specify this during index creation but I didn't see us specify it anywhere. It seems to set the size of the instance type used by the vector search (which I also had trouble figuring out what we were using there, do you know? Is it picked by default somehow?)
@fedorgrab See this section on SHARD_SIZE it sounds like you are required to specify this during index creation but I didn't see us specify it anywhere. It seems to set the size of the instance type used by the vector search (which I also had trouble figuring out what we were using there, do you know? Is it picked by default somehow?)
Shard size gets assigned automatically. Machine type is either default or also automatic.
Shard size gets assigned automatically. Machine type is either default or also automatic.
Hmmm... the support docs say When you create an index, you must specify the size of the shards to use
and The machine types that you can use to deploy your index ...depends on the shard size of the index.
Which makes it sound like that's not the case.
I'm wondering if by not specify a SHARD_SIZE
it is defaulting to something sub-optimal like a SHARD_SIZE_SMALL
and so limiting throughput. Did you learn from somewhere else how this was all being established? And would it be worth trying a SHARD_SIZE_LARGE
to see if that could help unblock stuff?
Shard size gets assigned automatically. Machine type is either default or also automatic.
Hmmm... the support docs say
When you create an index, you must specify the size of the shards to use
andThe machine types that you can use to deploy your index ...depends on the shard size of the index.
Which makes it sound like that's not the case.I'm wondering if by not specify a
SHARD_SIZE
it is defaulting to something sub-optimal like aSHARD_SIZE_SMALL
and so limiting throughput. Did you learn from somewhere else how this was all being established? And would it be worth trying aSHARD_SIZE_LARGE
to see if that could help unblock stuff?
I think increasing shard size will reduce the number of shards (because the capacity for each shard will become larger), thus we would get only a smaller throughput.
Our current shard_size is medium you can check out one of the indexes that we have: https://console.cloud.google.com/vertex-ai/locations/us-central1/indexes/766284837769183232/deployments?project=dsp-cell-annotation-service
Shard size gets assigned automatically. Machine type is either default or also automatic.
Hmmm... the support docs say
When you create an index, you must specify the size of the shards to use
andThe machine types that you can use to deploy your index ...depends on the shard size of the index.
Which makes it sound like that's not the case.I'm wondering if by not specify a
SHARD_SIZE
it is defaulting to something sub-optimal like aSHARD_SIZE_SMALL
and so limiting throughput. Did you learn from somewhere else how this was all being established? And would it be worth trying aSHARD_SIZE_LARGE
to see if that could help unblock stuff?
Also, according to the documents you sent previously, it is stated which default shard size is assigned to each machine type.
@evolvedmicrobe But anyway, I added shard size to the list of things to try out.
@fedorgrab @KevinCLydon any news from stress testing yesterday?
@10xjeff No big updates from my end, unfortunately. I'm still tweaking some of the retry logic and the batch sizes to see if I can get the error rate down.
@10xjeff, we experimented with the approximate_neighbors_count, autoscaling node count parameters, and batching. Here are the findings:
Insights:
Conclusion: Batching and adjusting approximate_neighbors_count may enhance throughput but don't address failure issues. To make the system resilient to our traffic, implementing retry logic for vector searches and/or queuing is necessary.
We also plan to further explore adjustments to leaf_node_size and shard size.
Yet, introducing batches significantly increased the qps (queries per second) from approximately 0.4 to over 50-60.
.
That feels like an unexpectedly low number. For something to compare against, I took the human-pca-10x-only-512-log1p-v1
dataset and loaded it with the ScaNN library following Google's Example . To just test what kind of QPS numbers we might get.
# Made a scann searcher
searcher = scann.scann_ops_pybind.builder(numpy_matrix[:,1:], 100, "dot_product").tree(
num_leaves=5359, num_leaves_to_search=10, training_sample_size=250000).score_ah(
2, anisotropic_quantization_threshold=0.2).reorder(100).build()
# Benchmarked it
start = time.time()
neighbors, distances = searcher.search_batched(numpy_matrix[:100, 1:], leaves_to_search=1000)
end = time.time()
print("1000 Leaves Time w/ 100 Neighbors QPS:", neighbors.shape[0] / float(end - start))
start = time.time()
neighbors, distances = searcher.search_batched(numpy_matrix[:100, 1:], leaves_to_search=150)
end = time.time()
print("150 Leaves Time w/ 100 Neighbors QPS:", neighbors.shape[0] / float(end - start))
Result:
1000 Leaves Time w/ 100 Neighbors QPS: 45.19609686608152
150 Leaves Time w/ 100 Neighbors QPS: 256.2399121247555
150 Leaves Time w/ 50 Neighbors: QPS 271.10153354525204
And on our machine I was getting >45 QPS even when searching through 1000 leaves. Which makes me think we should be getting faster results, it might be worth telling Google just how bad the QPS is and seeing if they have any further ideas. I think fiddling with the partitioning parameters should hopefully help a lot though.
Quick-ish update: Did some testing yesterday with different batch sizes and retry logic and had some success in reducing error count, but it's hard to tell if the reduction is actually a result of the index having scaled up before my successful tests. Today, I put together a script to run several tests with different configurations of batch sizes, file sizes, and retry params and print some info on run duration and exceptions to a CSV. I'm gonna run it probably overnight tonight and then check Monday to see what changes had the most effect. I'll probably have to cross reference with some of the error reporting and activity monitoring in the cloud console (that stuff doesn't all seem to be exposed to the REST API or python SDK, unfortunately). I'll report my findings on Monday.
We also have another meeting with Google Monday afternoon, so we'll hopefully get some useful info out of that.
Another quick update: Seeing performance improvements and significant reduction in error counts with small batches + retry change. I have a PR for this right now that is being reviewed and iterated on, so we should be pretty close to getting those changes in.
Sentry Issue: CELLARIUM-CLOUD-10