Open janekmichalik opened 2 weeks ago
A few things could be going on here. Can you share more details of how you are generating load? That might give us more of a clue.
One thing to look at is the caches themselves. There are multiple caches in SpiceDB; the sum of their defaults + overhead for normal running may greater than the 1GB you have available during your load test. I see you have set --dispatch-cluster-cache-max-cost=20%
but there is also --dispatch-cache-max-cost=20%
.
By default SpiceDB (grpc, really) spawns a goroutine per request. If you are sending a huge number of parallel requests, it may be spinning up too many goroutines and eating up your overhead. There are flags to control this as well (--grpc-max-workers
and --dispatch-cluster-max-workers
). Note that right now, SpiceDB itself doesn't have any sophisticated load shedding beyond what grpc provides, so it's still possible to blow memory up with sufficient load. If that's an operational concern, we recommend sticking a rate limiter in front of SpiceDB (at least until we have admission controls in place).
It's also possible that something is interfering with connection management on the database side. Are you running e.g. pgbouncer?
If none of these are the issue, it may be that you are hitting some edge case in a specific request that is causing a true memory leak, so knowing what requests you are making would be very helpful. At first glance though I don't think this is a leak - it seems like we just need to do some right-sizing for your test.
Thank you @ecordell for your quick response.
but there is also --dispatch-cache-max-cost=20%
I left the default value of 30%
There are flags to control this as well (--grpc-max-workers and --dispatch-cluster-max-workers).
I am testing it right now. Once finished, I will be back with some details.
Are you running e.g. pgbouncer?
As far as I know, we do not.
@ecordell
The problem still appears even with the limited amount of workers.
This is my current configuration:
- spicedb
- serve
- --skip-release-check
- --grpc-tls-cert-path=/etc/tls-secrets/tls.crt
- --grpc-tls-key-path=/etc/tls-secrets/tls.key
- --datastore-engine=postgres
- --datastore-conn-uri=postgres://$(POSTGRES_USERNAME):$(POSTGRES_PASSWORD)@$(POSTGRES_HOST)/$(POSTGRES_DATABASE)?sslmode=$(POSTGRES_SSLMODE)
- --telemetry-endpoint=
- --datastore-conn-pool-read-max-open=20
- --datastore-conn-pool-write-max-open=10
- --datastore-gc-interval=5m
- --datastore-gc-window=1h
- --dispatch-cache-metrics=true
- --dispatch-cluster-cache-metrics=true
- --dispatch-concurrency-limit=50
- --dispatch-cache-num-counters=500
- --dispatch-cluster-cache-max-cost=20%
- --dispatch-cluster-cache-num-counters=5000
- --log-level=debug
- --grpc-max-workers=1000
- --dispatch-cluster-max-workers=1000
the memory usage is increasing during the load
....
impt-spice-db-5cxtg spicedb 95m 655Mi
impt-spice-db-5cxtg spicedb 78m 912Mi
impt-spice-db-5cxtg spicedb 18m 820Mi
impt-spice-db-5cxtg spicedb 349m 954Mi
also with the current config the load test execution time has increased a lot.
I am generating the load by running e2e tests, which are executing various scenarios on my application. Those tests are running for about 20hours. We are checking permissions (opa-istio sidecar is intercepting the request and validate it against policies in spicedb) per each request made to and within the application.
check_authorization_spicedb(spicedb_address, auth_token, resource_type, resource_id, permission, subject_type, subject_id) := response if {
print("Checking", permission, "permission for", resource_type, "object", resource_id, "and", subject_type, "subject", subject_id)
request := {
"url": sprintf("http://%s:8443/v1/permissions/check", [spicedb_address]),
"method": "POST",
"headers": {
"content-type": "application/json",
"Authorization": concat(" ", ["Bearer", auth_token]),
},
"body": {
"resource": {
"objectType": resource_type,
"objectId": resource_id,
},
"permission": permission,
"subject": {"object": {
"objectType": subject_type,
"objectId": subject_id,
}},
"consistency": {"fullyConsistent": true},
},
"cache": true,
}
I have attached also the logs from spicedb.
What is the QPS of checks during your load test? You will need to size the node to handle the maximum number of parallel requests.
It also looks like you're using http
to call check instead of grpc
- that will have much higher overhead per call, can you switch to GRPC?
Do you have the ability to scrape and graph prometheus metrics? We have a lot of metrics that can help diagnose including cache size metrics.
It's perfectly normal for the memory to increase with Checks up to a certain point; that's normal and expected as the caches fill. The caches will hit a limit, and then memory use on top will be dictated by the number of in-flight queries, which you control in your load test.
What platforms are affected?
linux
What architectures are affected?
amd64
What SpiceDB version are you using?
v1.30.0 v1.33.0
Steps to Reproduce
Use spicedb for permissions check.
Spicedb runtime configuration:
but I have also tried with the default one:
it is running within k3s pod.
Expected Result
Memory not increasing -> memory released.
Actual Result
The memory use is increasing with each permissions check during performance/load tests.
My pod has 1GB memory limit, and when it reaches the limit, it starts to throw 403 errors.
It looks like the cache is not released.