CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.9k stars 587 forks source link

Core dumps filling up PGDATA directory #2532

Open james-callahan opened 3 years ago

james-callahan commented 3 years ago

Overview

My data volume appears to be filling up with "core.$N" files.

Environment

Please provide the following details:

Steps to Reproduce

REPRO

Provide steps to get to the error condition:

  1. Make postgres crash (in my case I think it was #2531)
  2. Look in your /pgdata/pg13/ folder: notice a core. file.

EXPECTED

Any files not required for operation of postgres or holding my data should not be kept on a persistent volume.

ACTUAL

core.N files are created in PGDATA and never cleaned up.

Logs

$ ls -la /pgdata/pg13/
total 33400
drwx--S--- 19 postgres postgres     4096 Jul 12 12:22 .
drwxrwsr-x  4 root     postgres     4096 Jul 12 12:22 ..
drwxrws---  7 postgres postgres     4096 Jul 12 05:20 base
-rw-rw----  1 postgres postgres  2215936 Jul 12 06:08 core.167
-rw-rw----  1 postgres postgres 11419648 Jul 12 10:58 core.22452
-rw-rw----  1 postgres postgres 21495808 Jul 12 12:08 core.22861
-rw-rw----  1 postgres postgres        0 Jul 12 10:58 core.3691
-rw-rw----  1 postgres postgres        0 Jul 12 06:08 core.3786
-rw-rw----  1 postgres postgres        0 Jul 12 06:08 core.3787
-rw-rw----  1 postgres postgres  5660672 Jul 12 04:14 core.5021
-rw-------  1 postgres postgres       30 Jul 12 12:22 current_logfiles
drwxrws---  2 postgres postgres    12288 Jul 12 12:23 global
drwxrws---  2 postgres postgres     4096 Jul 12 02:54 log
-rw-rw----  1 postgres postgres      874 Jul 12 04:27 patroni.dynamic.json
drwxrws---  2 postgres postgres     4096 Jul 12 02:54 pg_commit_ts
drwxrws---  2 postgres postgres     4096 Jul 12 02:54 pg_dynshmem
-rw-rw----  1 postgres postgres      327 Jul 12 12:22 pg_hba.conf
-rw-rw----  1 postgres postgres      327 Jul 12 12:22 pg_hba.conf.backup
-rw-rw----  1 postgres postgres     1636 Jul 12 02:54 pg_ident.conf
-rw-rw----  1 postgres postgres     1636 Jul 12 12:22 pg_ident.conf.backup
drwxrws---  4 postgres postgres     4096 Jul 12 12:22 pg_logical
drwxrws---  4 postgres postgres     4096 Jul 12 02:54 pg_multixact
drwxrws---  2 postgres postgres     4096 Jul 12 02:54 pg_notify
drwxrws---  2 postgres postgres     4096 Jul 12 02:54 pg_replslot
drwxrws---  2 postgres postgres     4096 Jul 12 02:54 pg_serial
drwxrws---  2 postgres postgres     4096 Jul 12 02:54 pg_snapshots
drwxrws---  2 postgres postgres     4096 Jul 12 12:22 pg_stat
drwxrws---  2 postgres postgres     4096 Jul 12 12:28 pg_stat_tmp
drwxrws---  2 postgres postgres     4096 Jul 12 12:27 pg_subtrans
drwxrws---  2 postgres postgres     4096 Jul 12 02:54 pg_tblspc
drwxrws---  2 postgres postgres     4096 Jul 12 02:54 pg_twophase
-rw-rw----  1 postgres postgres        3 Jul 12 02:54 PG_VERSION
lrwxrwxrwx  1 postgres postgres       15 Jul 12 12:22 pg_wal -> /pgwal/pg13_wal
drwxrws---  2 postgres postgres     4096 Jul 12 02:54 pg_xact
-rw-rw----  1 postgres postgres       88 Jul 12 02:54 postgresql.auto.conf
-rw-rw----  1 postgres postgres    28000 Jul 12 02:54 postgresql.base.conf
-rw-rw----  1 postgres postgres    28000 Jul 12 12:22 postgresql.base.conf.backup
-rw-rw-r--  1 postgres postgres     1123 Jul 12 12:22 postgresql.conf
-rw-rw-r--  1 postgres postgres     1123 Jul 12 12:22 postgresql.conf.backup
-rw-rw----  1 postgres postgres      418 Jul 12 12:22 postmaster.opts
-rw-------  1 postgres postgres       77 Jul 12 12:22 postmaster.pid
$ file /pgdata/pg13/core.167
/pgdata/pg13/core.167: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from 'postgres: grafana-ha: grafana grafana 10.20.39.127(44862) COMMIT', real uid: 26, effective uid: 26, real gid: 26, effective gid: 26
jkatz commented 3 years ago

I believe this is a Postgres feature. If Postgres hits a SIGABRT (and I believe on a PANIC as well) it will dump a core file.

Perhaps PGO has some sort of opt-in that can do a sweep for this so if one does not want a core dump we can clean it. It feels a bit risky to me

james-callahan commented 3 years ago

I believe this is a Postgres feature. If Postgres hits a SIGABRT (and I believe on a PANIC as well) it will dump a core file.

inside of my database container:

$ cat /proc/sys/kernel/core_pattern
core

Which must be the default on amazon linux.

I believe this results in core files being added in the CWD of the process that crashes. In the case of postgres, that is /pgdata/pg13; hence why they end up in there.


Perhaps PGO has some sort of opt-in that can do a sweep for this so if one does not want a core dump we can clean it. It feels a bit risky to me

A better solution might be a ulimit -c 0 at the right point?

gricey432 commented 1 year ago

This hit me a few times over the last few weeks.

I got a PANIC: could not locate a valid checkpoint record. I fixed this, but in the interim all my nodes filled up their disks with core dumps, and now the "fixed" cluster can't start up at all to accept the fix.

IMO that's the real danger of this issue is that it inflates the RTO of other issues, and the human-intervention needed is error prone during a potentially high-stress outage.

andrewlecuyer commented 1 year ago

@gricey432 I am reviewing some older issues, and see that you recently provided an update this one.

Do you have any thoughts on desired behavior here? E.g. PGO cleaning up core dumps, preventing them in the first place, etc.

Thanks!

gricey432 commented 1 year ago

@andrewlecuyer this happened to us again the other day, was quite the process to clean them up by hand across every PVC for every pod. Personally I have no use for the core dumps and wouldn't miss them if they were never written in the first place. CD support has never asked me for them so far either.

As they can get quite large, my only worry with automatically cleaning them up is that if a core dump takes up the whole remaining disk it could block the clean up process from being able to start at all. If possible, preventing them in the first place would be more desirable for me.

I think that should be easier too, ulimit core 0 should stop them, but I don't think I currently have enough control over PGO's containers directly to turn it on.

Cheers

momiji commented 1 year ago

I have exactly the same issue, and disks filled up with near 1TB of core files, before we figured out what was happening. This is why I finally got here, isn't it?

I just find an idea in the CRD documentation here: https://access.crunchydata.com/documentation/postgres-operator/v5/references/crd/#postgresclusterspecinstancesindexcontainersindex

This would look like:

spec:
  instances:
  - command:
    - /bin/sh
    - -c
    - ulimit -c 0 && patroni /etc/patroni
  ...

I don't have the time immediately to test this, but I'll probably do it this week

Regards

momiji commented 1 year ago

OK forget about it, this only applies to a sub items containers, which is not what is wanted :( I guess my next move is to alter docker image used...

So after several attempts, it works, however it is far from being easy to do...

Ideas that didn't work

The good idea

The idea is to replace .spec.image from PostgresCluster definition with an alternative one build specifically to change ulimit.

To build it, I changed 2 things:

This way, the patroni that is started as an entrypoint is our /sg/patroni instaead of /usr/local/bin/patroni.

Cutomization patroni.sh:

#!/bin/bash -x
id
ulimit -c 0
ulimit -a
exec /usr/local/bin/patroni "$@"

Dockerfile:

FROM registry-local.kube-system.svc/sas-src/viya-4-x64_oci_linux_2-docker/sas-crunchy5-postgres:1.1.2-20230407.1680894190141
ENV PATH=/sg:/usr/pgsql-12/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
COPY patroni.sh /sg/patroni

How to test

To verify it it working: