Memory usage increasing with each permissions check call

janekmichalik commented 2 weeks ago

What platforms are affected?

linux

What architectures are affected?

amd64

What SpiceDB version are you using?

v1.30.0 v1.33.0

Steps to Reproduce

Use spicedb for permissions check.

Spicedb runtime configuration:

command:
  - spicedb
  - serve
  - --skip-release-check
  - --grpc-tls-cert-path=/etc/tls-secrets/tls.crt
  - --grpc-tls-key-path=/etc/tls-secrets/tls.key
  - --datastore-engine=postgres
  - --datastore-conn-uri=postgres://$(POSTGRES_USERNAME):$(POSTGRES_PASSWORD)@$(POSTGRES_HOST)/$(POSTGRES_DATABASE)?sslmode=$(POSTGRES_SSLMODE)
  - --telemetry-endpoint=
  - --datastore-conn-pool-read-max-open=20 
  - --datastore-conn-pool-write-max-open=10
  - --datastore-gc-interval=5m
  - --datastore-gc-window=1h
  - --dispatch-cache-metrics=true
  - --dispatch-cluster-cache-metrics=true
  - --dispatch-concurrency-limit=50
  - --dispatch-cache-num-counters=500
  - --dispatch-cluster-cache-max-cost=20%
  - --dispatch-cluster-cache-num-counters=5000
  - --log-level=debug

but I have also tried with the default one:

command:
  - spicedb
  - serve
  - --skip-release-check
  - --grpc-tls-cert-path=/etc/tls-secrets/tls.crt
  - --grpc-tls-key-path=/etc/tls-secrets/tls.key
  - --datastore-engine=postgres
  - --datastore-conn-uri=postgres://$(POSTGRES_USERNAME):$(POSTGRES_PASSWORD)@$(POSTGRES_HOST)/$(POSTGRES_DATABASE)?sslmode=$(POSTGRES_SSLMODE)
  - --telemetry-endpoint=

it is running within k3s pod.

Expected Result

Memory not increasing -> memory released.

Actual Result

The memory use is increasing with each permissions check during performance/load tests.

My pod has 1GB memory limit, and when it reaches the limit, it starts to throw 403 errors.

It looks like the cache is not released.

ecordell commented 2 weeks ago

A few things could be going on here. Can you share more details of how you are generating load? That might give us more of a clue.

One thing to look at is the caches themselves. There are multiple caches in SpiceDB; the sum of their defaults + overhead for normal running may greater than the 1GB you have available during your load test. I see you have set --dispatch-cluster-cache-max-cost=20% but there is also --dispatch-cache-max-cost=20%.

By default SpiceDB (grpc, really) spawns a goroutine per request. If you are sending a huge number of parallel requests, it may be spinning up too many goroutines and eating up your overhead. There are flags to control this as well (--grpc-max-workers and --dispatch-cluster-max-workers). Note that right now, SpiceDB itself doesn't have any sophisticated load shedding beyond what grpc provides, so it's still possible to blow memory up with sufficient load. If that's an operational concern, we recommend sticking a rate limiter in front of SpiceDB (at least until we have admission controls in place).

It's also possible that something is interfering with connection management on the database side. Are you running e.g. pgbouncer?

If none of these are the issue, it may be that you are hitting some edge case in a specific request that is causing a true memory leak, so knowing what requests you are making would be very helpful. At first glance though I don't think this is a leak - it seems like we just need to do some right-sizing for your test.

janekmichalik commented 2 weeks ago

Thank you @ecordell for your quick response.

but there is also --dispatch-cache-max-cost=20%

I left the default value of 30%

There are flags to control this as well (--grpc-max-workers and --dispatch-cluster-max-workers).

I am testing it right now. Once finished, I will be back with some details.

Are you running e.g. pgbouncer?

As far as I know, we do not.

janekmichalik commented 2 weeks ago

@ecordell

The problem still appears even with the limited amount of workers.

This is my current configuration:

    - spicedb
    - serve
    - --skip-release-check
    - --grpc-tls-cert-path=/etc/tls-secrets/tls.crt
    - --grpc-tls-key-path=/etc/tls-secrets/tls.key
    - --datastore-engine=postgres
    - --datastore-conn-uri=postgres://$(POSTGRES_USERNAME):$(POSTGRES_PASSWORD)@$(POSTGRES_HOST)/$(POSTGRES_DATABASE)?sslmode=$(POSTGRES_SSLMODE)
    - --telemetry-endpoint=
    - --datastore-conn-pool-read-max-open=20
    - --datastore-conn-pool-write-max-open=10
    - --datastore-gc-interval=5m
    - --datastore-gc-window=1h
    - --dispatch-cache-metrics=true
    - --dispatch-cluster-cache-metrics=true
    - --dispatch-concurrency-limit=50
    - --dispatch-cache-num-counters=500
    - --dispatch-cluster-cache-max-cost=20%
    - --dispatch-cluster-cache-num-counters=5000
    - --log-level=debug
    - --grpc-max-workers=1000
    - --dispatch-cluster-max-workers=1000

the memory usage is increasing during the load

....
impt-spice-db-5cxtg                                  spicedb                   95m          655Mi
impt-spice-db-5cxtg                                  spicedb                   78m          912Mi
impt-spice-db-5cxtg                                  spicedb                   18m          820Mi
impt-spice-db-5cxtg                                  spicedb                   349m         954Mi

also with the current config the load test execution time has increased a lot.

I am generating the load by running e2e tests, which are executing various scenarios on my application. Those tests are running for about 20hours. We are checking permissions (opa-istio sidecar is intercepting the request and validate it against policies in spicedb) per each request made to and within the application.

check_authorization_spicedb(spicedb_address, auth_token, resource_type, resource_id, permission, subject_type, subject_id) := response if {
    print("Checking", permission, "permission for", resource_type, "object", resource_id, "and", subject_type, "subject", subject_id)
    request := {
        "url": sprintf("http://%s:8443/v1/permissions/check", [spicedb_address]),
        "method": "POST",
        "headers": {
            "content-type": "application/json",
            "Authorization": concat(" ", ["Bearer", auth_token]),
        },
        "body": {
            "resource": {
                "objectType": resource_type,
                "objectId": resource_id,
            },
            "permission": permission,
            "subject": {"object": {
                "objectType": subject_type,
                "objectId": subject_id,
            }},
            "consistency": {"fullyConsistent": true},
        },
        "cache": true,
    }

I have attached also the logs from spicedb.

spicedb.log

ecordell commented 2 weeks ago

What is the QPS of checks during your load test? You will need to size the node to handle the maximum number of parallel requests.

It also looks like you're using http to call check instead of grpc - that will have much higher overhead per call, can you switch to GRPC?

Do you have the ability to scrape and graph prometheus metrics? We have a lot of metrics that can help diagnose including cache size metrics.

It's perfectly normal for the memory to increase with Checks up to a certain point; that's normal and expected as the caches fill. The caches will hit a limit, and then memory use on top will be dictated by the number of in-flight queries, which you control in your load test.

authzed / spicedb