CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.94k stars 592 forks source link

pgbackrest_info OOM + 100% CPU #3293

Open lingwooc opened 2 years ago

lingwooc commented 2 years ago

Overview

There is an issue around the script /opt/crunchy/bin/postgres/pgbackrest_info.sh. I think this is related to monitoring. It is using over 3gb of ram and getting oomkilled. Suprisingly it is bash that is killed not pgbackrest, and pgbackrest does not seem to be started (although I may be mistaken).

The issue seems to be that:

echo $(echo -n "$conf|" | tr '/' '_'; pgbackrest --output=json ${cmd_args[*]} info | tr -d '\n')

Causes bash to use a biblical amount of ram and CPU while not really doing anything.

Running the commands manually in the containers bash console gives some insight These commands work: pgbackrest --output=json ${cmd_args[*]} info pgbackrest --output=json ${cmd_args[*]} info | tr -d '\n' echo -n "$conf|" | tr '/' '_'; pgbackrest --output=json ${cmd_args[*]} info | tr -d '\n'

This command is oomkilled (and uses 100% cpu). echo $(echo -n "$conf|" | tr '/' '_'; pgbackrest --output=json ${cmd_args[*]} info | tr -d '\n')

Maybe i'm missing something but I don't see the value in the wrapping echo. I also can't see why this would be such an impactful thing to do.

The troublesome environments do have quite a few incremental and and differential backups (up to 690 diffs with 3 times as many incremental) but its not pgbackrest that is the problem.

I should note I'm not seeing this issue on all our environments, but that could be because there is enough headroom in ram by dumb luck.

I'm goign to try and patch this script and see where I end up.

Environment

Please provide the following details:

lingwooc commented 2 years ago

Just linking to the troublesome file. https://github.com/CrunchyData/crunchy-containers/blob/master/bin/postgres_common/postgres/pgbackrest_info.sh Also it references /tmp/pgbackrest_env.sh which does not exist.

andrewlecuyer commented 2 years ago

@lingwooc thanks for submitting this issue.

Looking at the PGO code where we enable the exporter using pgMonitor, you can actually see that this is something we'd like to revisit:

https://github.com/CrunchyData/postgres-operator/blob/2e18aef93dd2d6dee065ad00c959dc9fabc6da79/internal/pgmonitor/api.go#L45-L52

More specifically, ideally we'd like to align with the pgbackrest-info.sh script provided by pgMonitor:

https://github.com/CrunchyData/pgmonitor/blob/main/postgres_exporter/linux/pgbackrest-info.sh

jmckulk commented 2 years ago

Hello @lingwooc,

I have tried to replicate this in my local environment but have so far been unable to. I created a memory-limited PostgresCluster and ran a handful of backups. The echo command you mentioned ran without any issues.

Can you share the spec for one of the troublesome PostgresClusters? I'm interested to see what your backup schedules are set to. The fact that you have so many diff and incremental backups is abnormal. If you configure pgBackRest backup retention to expire some of those backups, do you still experience this issue?

benjaminjb commented 2 years ago

Following up, as we haven't been able to replicate this, I'm closing out this issue, but if you have any new information feel free to re-open or create a new issue.

Pluggi commented 1 year ago

Hello,

We have just been bitten by this bug. We have backups running every couple hours, totalling to more than a thousand backups.

Here are some infos:

$ pgbackrest info | grep -E '(incr|full)' | wc -l
1254
$ pgbackrest --output=json info | wc
      0       1  820446

$ dmesg -T | grep 'Killed process'
[Wed Apr  5 13:24:02 2023] Memory cgroup out of memory: Killed process 2364753 (pgbackrest_info) total-vm:899240kB, anon-rss:891316kB, file-rss:3068kB, shmem-rss:0kB, UID:26 pgtables:1792kB oom_score_adj:969
[Wed Apr  5 13:24:03 2023] Memory cgroup out of memory: Killed process 2364799 (pgbackrest_info) total-vm:905492kB, anon-rss:897388kB, file-rss:2992kB, shmem-rss:0kB, UID:26 pgtables:1808kB oom_score_adj:969
[Wed Apr  5 13:24:07 2023] Memory cgroup out of memory: Killed process 2364930 (pgbackrest_info) total-vm:903164kB, anon-rss:895272kB, file-rss:3012kB, shmem-rss:0kB, UID:26 pgtables:1796kB oom_score_adj:969
[Wed Apr  5 13:24:09 2023] Memory cgroup out of memory: Killed process 2365103 (pgbackrest_info) total-vm:899244kB, anon-rss:891312kB, file-rss:3056kB, shmem-rss:0kB, UID:26 pgtables:1792kB oom_score_adj:969
[Wed Apr  5 13:24:10 2023] Memory cgroup out of memory: Killed process 2365143 (pgbackrest_info) total-vm:898708kB, anon-rss:890784kB, file-rss:3008kB, shmem-rss:0kB, UID:26 pgtables:1792kB oom_score_adj:969
[Wed Apr  5 13:24:12 2023] Memory cgroup out of memory: Killed process 2365206 (pgbackrest_info) total-vm:897652kB, anon-rss:889732kB, file-rss:3008kB, shmem-rss:0kB, UID:26 pgtables:1788kB oom_score_adj:969
[Wed Apr  5 13:24:13 2023] Memory cgroup out of memory: Killed process 2365239 (pgbackrest_info) total-vm:897120kB, anon-rss:889204kB, file-rss:3032kB, shmem-rss:0kB, UID:26 pgtables:1792kB oom_score_adj:969
[Wed Apr  5 13:24:17 2023] Memory cgroup out of memory: Killed process 2365379 (pgbackrest_info) total-vm:894704kB, anon-rss:886828kB, file-rss:3156kB, shmem-rss:0kB, UID:26 pgtables:1792kB oom_score_adj:969
[Wed Apr  5 13:24:19 2023] Memory cgroup out of memory: Killed process 2365489 (pgbackrest_info) total-vm:893616kB, anon-rss:885768kB, file-rss:3076kB, shmem-rss:0kB, UID:26 pgtables:1788kB oom_score_adj:969
[Wed Apr  5 13:24:20 2023] Memory cgroup out of memory: Killed process 2365591 (pgbackrest_info) total-vm:952220kB, anon-rss:944380kB, file-rss:2988kB, shmem-rss:0kB, UID:26 pgtables:1900kB oom_score_adj:969
[Wed Apr  5 13:24:22 2023] Memory cgroup out of memory: Killed process 2365653 (pgbackrest_info) total-vm:952408kB, anon-rss:944640kB, file-rss:3032kB, shmem-rss:0kB, UID:26 pgtables:1896kB oom_score_adj:969
[Wed Apr  5 13:24:23 2023] Memory cgroup out of memory: Killed process 2365688 (pgbackrest_info) total-vm:952980kB, anon-rss:945168kB, file-rss:3080kB, shmem-rss:0kB, UID:26 pgtables:1900kB oom_score_adj:969
[Wed Apr  5 13:24:27 2023] Memory cgroup out of memory: Killed process 2365829 (pgbackrest_info) total-vm:951452kB, anon-rss:943584kB, file-rss:2932kB, shmem-rss:0kB, UID:26 pgtables:1896kB oom_score_adj:969
[Wed Apr  5 13:24:29 2023] Memory cgroup out of memory: Killed process 2365976 (pgbackrest_info) total-vm:952408kB, anon-rss:944376kB, file-rss:3112kB, shmem-rss:0kB, UID:26 pgtables:1904kB oom_score_adj:969
[Wed Apr  5 13:24:30 2023] Memory cgroup out of memory: Killed process 2366168 (pgbackrest_info) total-vm:951644kB, anon-rss:943588kB, file-rss:3028kB, shmem-rss:0kB, UID:26 pgtables:1900kB oom_score_adj:969
[Wed Apr  5 13:24:32 2023] Memory cgroup out of memory: Killed process 2366204 (pgbackrest_info) total-vm:952788kB, anon-rss:944904kB, file-rss:3028kB, shmem-rss:0kB, UID:26 pgtables:1892kB oom_score_adj:969
[Wed Apr  5 13:24:34 2023] Memory cgroup out of memory: Killed process 2366311 (pgbackrest_info) total-vm:952028kB, anon-rss:944112kB, file-rss:2996kB, shmem-rss:0kB, UID:26 pgtables:1892kB oom_score_adj:969
[Wed Apr  5 13:48:34 2023] Memory cgroup out of memory: Killed process 2438970 (pgbackrest_info) total-vm:953360kB, anon-rss:945436kB, file-rss:3256kB, shmem-rss:0kB, UID:26 pgtables:1900kB oom_score_adj:969
Pluggi commented 1 year ago

Can we re-open this issue?

Pluggi commented 1 year ago

Boop

Pluggi commented 1 year ago

I'd be happy to provide more details to investigate this :)

wmuldergov commented 3 months ago

I have also been running into this exact issue in 2 of our 4 environments. The 2 environments that are running into this issue are prd/uat and run backups more frequently than the dev/tst environments which works as expected. I can also provide more details if required!

Environment

Please provide the following details: