Unexpectedly High CPU Usage during Load test

bhupixb commented 1 year ago

Environment

PostgreSQL version: 13.7
PostgREST version: v9.0.0
Operating system: AWS EKS Linux(5.10.x+ kernel version)

Description of issue

We are doing some load tests on PostgREST. Postgrest is running on Kubernetes pods with 2 replicas. It connects to AWS-managed PostgreSQL RDS. During the load testing, we are seeing out of the 2 pods running, 1 of the pod CPU spikes very high. In the below pic, we can see that CPU usage of 1 of the pod went to almost 6 cores.

Initially, we suspected that one pod might be serving more requests than the other, but from the metrics, we saw both pods are serving almost the same no. of requests per second(QPS):

This is happening every time we do the load test, even if we scale pods to more than 2, to say 5 pods. Then we see similar behaviour with 1/2 pods and the rest of the pods keep using normal CPU.

Expected Behaviour

Since both pods are serving the same no. of requests per second, the CPU usage difference should not be this huge.

Actual Behaviour

CPU usage for a few pods is quite high as compared to others despite serving almost same amount of traffic.

Pod config for postgrest

requests:
  cpu: 1.5
  mem: 1Gi
limits:
  cpu: 3
  mem: 2Gi
replicas: 2

Postgrest Configuration:

  PGRST_DB_POOL: "10"
  GHCRTS: "-qg -I0" # default in v9.0.1

(Steps to reproduce: Include a minimal SQL definition plus how you make the request to PostgREST and the response body) Sample Deployment file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgrest
spec:
  replicas: 2
  selector:
    matchLabels:
      test: test
  template:
    metadata:
      labels:
        test: test
    spec:
      containers:
        - env:
            - name: PGRST_DB_URI
              value: "<REDACETED>"
            - name: PGRST_DB_SCHEMA
              value: public
            - name: PGRST_JWT_SECRET
              value: <REDACETED>
            - name: GHCRTS
              value: '-qg -I0'
            - name: PGRST_DB_ANON_ROLE
              value: <REDACETED>
            - name: PGRST_DB_POOL
              value: '10'
            - name: PGRST_DB_POOL_TIMEOUT
              value: '300'
          image: postgrest/postgrest:v9.0.0
          imagePullPolicy: IfNotPresent
          livenessProbe:
            httpGet:
              path: /postgrest_health
              port: http-postgrest
          name: postgrest
          ports:
            - containerPort: 3000
              name: http-postgrest
              protocol: TCP
          readinessProbe:
            httpGet:
              path: /postgrest_health
              port: http-postgrest
          resources:
            limits:
              cpu: 3
              memory: 2Gi
            requests:
              cpu: 1.5
              memory: 1Gi

Any suggestion from the community on what things can be checked to find the root cause will be appreciated. Please let me know if I need to share any other info.

wolfgangwalther commented 1 year ago

It seems like this only happens a few minutes after the loadtest starts?

What happens if you run the loadtest longer, let's say twice as long? Is CPU usage going down again or only when the loadtest stops?

Did you run the same test against v10? We did some work on connection handling, especially when running out of file descriptors and other resources... maybe the backlog of unserved requests is just growing huge after a certain amount of time, because the load generator creates more requests than are actually handled?

Also: Where is the throughput in requests measured? At the load generating side? Are you sure the load is actually balanced between the two pods properly?

steve-chavez commented 1 year ago

I'd also suggest to run a new version. Even on v9.0.1 we added some performance fixes(like https://github.com/PostgREST/postgrest/issues/2294)

bhupixb commented 1 year ago

First of all thank you for the quick response.

It seems like this only happens a few minutes after the loadtest starts?

Not always at start, but it almost always happens during the load test, more often during the start and sometimes after some minutes of Load Test.

Is CPU usage going down again or only when the load test stops?

This CPU spike is only for some time, 3-4minutes, then HPA scales up replicas based on CPU, new pods come up & this CPU spike comes to normal.

Did you run the same test against v10?

No, we have not. Will try that as well.

Where is the throughput in requests measured?

The screenshot I attached for QPS is measured on PostgREST pod(s), so we are 100% sure that it's not a load-balancing issue.

@steve-chavez The perf fixes in v9.0.1, we have already disabled parallel & idle GC by setting env var inv PostgREST GHCRTS: "-qg -I0" for our current version v9.0.0. It helped us reduce overall CPU by almost 50% but didn't help with the CPU spikes.

bhupixb commented 1 year ago

Any suggestions here? Thank you

steve-chavez commented 1 year ago

@bhupixb Try the latest pre-release and set the new db-pool-acquisition-timeout = 10(or a shorter value), I can confirm that lowers the CPU usage when under heavy traffic.

steve-chavez commented 1 year ago

@bhupixb Did the new timeout helped? Do you have new results you can share?

bhupixb commented 1 year ago

Hi @steve-chavez, no, unfortunately. I did 2 tests with Acquisition timeout: 2, and Postgrest 10.1.0. The tests include running same query on an indexed column of a table SELECT * FROM table WHERE column=1;, 2 pods of postgrest running. Test 1:

DB Pool size: 10

I tested this on 500 RPS and increased the RPS to 600 to cause 504. As soon as the first 504 is received(at 20:00, refer bottom right panel 5XX QPS by service), I decreased the RPS from 600 to ~300(see QPS panel) at 20:00. The CPU spikes suddenly and remains high for some time even though RPS was reduced. Out of total 234092 requests, 200 OK -> 222071, 504 Bad gateway -> 12021 It's good that now we are explicitly getting 504 and we can know that we need to increase the DB connection pool size to handle our load but the tremendous increase in CPU usage is a bit concerning. The CPU usage, RPS & Mem usage graphs are attached below:

Test 2:

DB Pool size: 50(per pod).

Didn't see a single 504 Bad gateway. The resource usage image is attached below:

bhupixb commented 1 year ago

Update: We found a workaround for this by giving a sufficiently large no. of DB connections pool size to each pod(and add idle conn. timeout to 5min) so that Postgrest doesn't run out of DB connections in case there's spiky load, post that this error didn't occur.

My theory: 1 thing that we noticed that consistently happens is that the CPU usage goes unexpectedly high when there are not enough DB connections available to serve the current requests. I am not sure if it's a bug on the HTTP Server side that is not handling the queued requests correctly or if this is expected behaviour.

steve-chavez commented 1 year ago

@bhupixb Sorry for the late reply, thanks for the detailed report.

1 thing that we noticed that consistently happens is that the CPU usage goes unexpectedly high when there are not enough DB connections available to serve the current requests. I am not sure if it's a bug on the HTTP Server side that is not handling the queued requests correctly or if this is expected behaviour.

I believe this happens because we're still doing work(parsing the querystring, searching on the postgrest schema cache, etc) when it doesn't make sense to do so because we don't have available connections on the pool.

Not sure if this is the correct way to handle this. But one option to avoid unnecessary work could be that once a number of 504s happen, we can get into a temporary state where we quickly reject new requests.

pgbouncer has server_login_retry, which kinda looks related on intention:

If login to the server failed, because of failure to connect or from authentication, the pooler waits this much before retrying to connect. During the waiting interval, new clients trying to connect to the failing server will get an error immediately without another connection attempt. [seconds]

The purpose of this behavior is that clients don’t unnecessarily queue up waiting for a server connection to become available if the server is not working. However, it also means that if a server is momentarily failing, for example during a restart or if the configuration was erroneous, then it will take at least this long until the pooler will consider connecting to it again

Default: 15.0

josethz00 commented 4 months ago

Having similar issues with postgrest docker latest image. I am running a gatling load test in Kubuntu 22.0 intel7 16GB RAM

After ~1 minute of load test, I get the error of "too many" connections

steve-chavez commented 4 months ago

Maybe try setting https://postgrest.org/en/stable/references/configuration.html#jwt-cache-max-lifetime to 3600 and see if it reduces CPU usage.

After ~1 minute of load test, I get the error of "too many" connections

That should only happen if you set db-pool to surpass postgres max_connections.

PostgREST / postgrest