CrunchyData / crunchy-containers

Containers for Managing PostgreSQL on Kubernetes by Crunchy Data
https://www.crunchydata.com/
Apache License 2.0
1.01k stars 329 forks source link

kube-apiserver crashes during pgbackrest backups #1539

Open iohenkies opened 12 months ago

iohenkies commented 12 months ago

Hi all,

Originally I posted this at https://github.com/pgbackrest/pgbackrest/issues/2118 but was advised to give it a go here. So hopefully you have any idea? :)

We've got a 85 node cluster running all sorts of stuff. Control planes and etcd are separated from our workers and from each other, so all separate nodes. Then we have for instance Elasticsearch on a separate nodepool, a lot of workers for all kinds of apps, and our Postgres databases on a separate nodepool. These are 11 nodes with 8vCPU and 32GB mem each.

At 2am and 6am about 60 pgbackrest backups are started. This often, but not always, makes our kube-apiserver containers on our control planes crash. This is very strange to us, because why would pgbackrest cause such a constraint on the apiserver? We've tried to replicate this issue by spawning 300 pods with another app at the same time, calling the apiserver, and then the kube-apiserver remains running. It only seems to be happening during these backups.

We have audit logging enabled on the kube-apiserver and up till right before the crashes, we don't see anything unusual, but then it gets too busy and crashes and we probably can't catch the very end of the logs. The only thing in the pgbackrest logs that sticks out is quite a lot of these apiserver was unable to write a JSON response: http: Handler timeout errors. Not only during crash, but also during the day.

Now, we are no database experts, our DBA colleague who was the lead in setting up Postgres is on a long sick leave, so we're hoping to make use if the expertise here! Maybe there are settings there can be tweaked? Or explained what and if pgbackrest is doing a lot of calls to the apiserver?

  1. pgBackRest version: pgBackRest 2.40

  2. PostgreSQL version: postgres (PostgreSQL) 14.5

  3. Operating system/version - if you have more than one server (for example, a database server, a repository host server, one or more standbys), please specify each: Kubernetes 1.24.10 on Ubuntu 20.04.5 LTS nodes

  4. Did you install pgBackRest from source or from a package? Installed on Kubernetes 1.24.10, running image registry.developers.crunchydata.com/crunchydata/postgres-operator:ubi8-5.2.0-0

  5. Please attach the following as applicable: pgbackrest conf

    
    bash-4.4$ cat pgbackrest_instance.conf
    # Generated by postgres-operator. DO NOT EDIT.
    # Your changes will not be saved.

[global] buffer-size = 2MiB compress-type = lz4 log-path = /pgdata/pgbackrest/log process-max = 2 repo1-path = /pgbackrest/grafana/grafana repo1-retention-full = 2 repo1-retention-full-type = time repo1-s3-bucket = npo repo1-s3-endpoint = storagegrid.s3.ourdomain.com repo1-s3-port = 443 repo1-s3-region = NL-AER-1 repo1-s3-uri-style = path repo1-storage-ca-file = /etc/pgbackrest/conf.d/root.pem repo1-storage-verify-tls = y repo1-type = s3

[db] pg1-path = /pgdata/pg14 pg1-port = 5432 pg1-socket-path = /tmp/postgres

**Backup command**

bash -ceu -- shopt -s globstar files=(/etc/pgbackrest/conf.d/**) for i in "${!files[@]}"; do ?[[ -f "${files[$i]}" ]] || unset -v "files[$i]" done declare -r hash="$1" local_hash="$(sha1sum "${files[@]}" | sha1sum)" if [[ "${local_hash}" != "${hash}" ]]; then ?printf >&2 "hash %s does not match local hash %s" "${hash}" "${local_hash}"; exit 1; else ?pgbackrest backup --stanza=db --repo=1 --type=incr fi - 725c12672026deac030f95c75a5abee7186e180a -

**Errors in log**

apiserver was unable to write a JSON response: http: Handler timeout