CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.87k stars 585 forks source link

Rate of S3 transactions for pgBackRest backups increase over time #3960

Open JJGadgets opened 1 month ago

JJGadgets commented 1 month ago

Please ensure you do the following when reporting a bug:

Note that some logs needed to troubleshoot may be found in the /pgdata/<CLUSTER-NAME>/pg_log directory on your Postgres instance.

An incomplete bug report can lead to delays in resolving the issue or the closing of a ticket, so please be as detailed as possible.

If you are looking for general support, please view the support page for where you can ask questions.

Thanks for reporting the issue, we're looking forward to helping you!

Overview

I have observed that after a while of a PostgresCluster being applied and running on my Kubernetes homelab cluster, my pgBackRest bucket on Cloudflare R2 will consume more and more transactions each month. More context in the Additional Information section below.

Environment

Please provide the following details:

Steps to Reproduce

REPRO

Provide steps to get to the error condition:

  1. Apply PostgresCluster with pgBackRest repos pointed to an R2 bucket.
  2. Set up R2 transaction count alerts.
  3. Leave the PostgresCluster running for a few months at almost similar load across the whole duration.
  4. Check R2 transaction count alerts and check the date within the month of each alert.

EXPECTED

  1. Transaction count and rate remains constant across months for roughly the same amount of database operations each month.
  2. Stay within R2 free tier for transaction count.

ACTUAL

  1. R2 transaction count goes past the free tier.
  2. Rate of transactions increases the longer the PostgresCluster age.

Logs

Unsure of what logs would be relevant to this issue. Advice on what logs to drill down on would be helpful.

R2 dashboard only shows transaction count up to a week without upgrading account plans, and this issue's timeframe is mainly in the unit of months, not days or weeks.

Additional Information

R2: Class A is mainly for uploads (CRUD), Class B is mainly for downloads. Class A has a free tier of 1 million transactions, Class B's free tier is 10 million.

My R2 transaction count alerts are triggering at earlier days within each month as the PostgresCluster ages, suggesting that the rate of transactions increases as the months go by. I have the alerts in a Discord channel and can screenshot them if they would be helpful at all, but I doubt it.

I have wiped and restored the PostgresCluster using the pgBackRest dataSource multiple times just because of this issue which then brings the rate of transaction count back down, then after 2 months or so I start going past the R2 free tier again and have to wipe and restore again. This cycle repeats itself and has repeated at least 3 times.

PostgresCluster resource manifest (managed by FluxCD GitOps): https://github.com/JJGadgets/Biohazard/blob/4035c729132335ed4bab1ca4010c029a6db1c338/kube/deploy/core/db/pg/clusters/template/crunchy.yaml#L48-L143

JJGadgets commented 1 month ago

@joryirving and @drewburr are also experiencing similar issues, we've discussed about this and couldn't come up with a reason or solution.

tjmoore4 commented 1 month ago

@JJGadgets Thanks for the detailed explanation. A couple of suggestions.

The first would be to look closely at the pgBackRest logs. Based on your linked cluster manifest, you don't have a repo host Pod enabled, so the relevant logs should be located in /pgdata/pgbackrest/log on your primary Pod. Those logs may give you a clue as to what might be happening. You could also increase your logging detail to give more information.

I also noticed your R2 repo (repo2) configuration is set to take full and incremental backups on a schedule, but it seems your retention settings are only for full and differential. Perhaps adding an incr retention policy might help in this case. Hope these suggestions help!