Open james-callahan opened 3 years ago
I believe this is a Postgres feature. If Postgres hits a SIGABRT (and I believe on a PANIC as well) it will dump a core file.
Perhaps PGO has some sort of opt-in that can do a sweep for this so if one does not want a core dump we can clean it. It feels a bit risky to me
I believe this is a Postgres feature. If Postgres hits a SIGABRT (and I believe on a PANIC as well) it will dump a core file.
inside of my database container:
$ cat /proc/sys/kernel/core_pattern
core
Which must be the default on amazon linux.
I believe this results in core files being added in the CWD of the process that crashes. In the case of postgres, that is /pgdata/pg13; hence why they end up in there.
Perhaps PGO has some sort of opt-in that can do a sweep for this so if one does not want a core dump we can clean it. It feels a bit risky to me
A better solution might be a ulimit -c 0
at the right point?
This hit me a few times over the last few weeks.
I got a PANIC: could not locate a valid checkpoint record
. I fixed this, but in the interim all my nodes filled up their disks with core dumps, and now the "fixed" cluster can't start up at all to accept the fix.
IMO that's the real danger of this issue is that it inflates the RTO of other issues, and the human-intervention needed is error prone during a potentially high-stress outage.
@gricey432 I am reviewing some older issues, and see that you recently provided an update this one.
Do you have any thoughts on desired behavior here? E.g. PGO cleaning up core dumps, preventing them in the first place, etc.
Thanks!
@andrewlecuyer this happened to us again the other day, was quite the process to clean them up by hand across every PVC for every pod. Personally I have no use for the core dumps and wouldn't miss them if they were never written in the first place. CD support has never asked me for them so far either.
As they can get quite large, my only worry with automatically cleaning them up is that if a core dump takes up the whole remaining disk it could block the clean up process from being able to start at all. If possible, preventing them in the first place would be more desirable for me.
I think that should be easier too, ulimit core 0
should stop them, but I don't think I currently have enough control over PGO's containers directly to turn it on.
Cheers
I have exactly the same issue, and disks filled up with near 1TB of core files, before we figured out what was happening. This is why I finally got here, isn't it?
I just find an idea in the CRD documentation here: https://access.crunchydata.com/documentation/postgres-operator/v5/references/crd/#postgresclusterspecinstancesindexcontainersindex
command
in the PostgresCluster
object is not set on first instance (I only have one instance in my pods)[ "patroni", "/etc/patroni" ]
[ "/bin/sh", "-c", "ulimit -c 0 && patroni /etc/patroni" ]
could workThis would look like:
spec:
instances:
- command:
- /bin/sh
- -c
- ulimit -c 0 && patroni /etc/patroni
...
I don't have the time immediately to test this, but I'll probably do it this week
Regards
OK forget about it, this only applies to a sub items containers, which is not what is wanted :( I guess my next move is to alter docker image used...
So after several attempts, it works, however it is far from being easy to do...
Ideas that didn't work
The good idea
The idea is to replace .spec.image
from PostgresCluster definition with an alternative one build specifically to change ulimit.
To build it, I changed 2 things:
/sg
folderpatroni
executable file that needs to be placed in /sg/patroni
This way, the patroni that is started as an entrypoint is our /sg/patroni
instaead of /usr/local/bin/patroni
.
Cutomization patroni.sh:
#!/bin/bash -x
id
ulimit -c 0
ulimit -a
exec /usr/local/bin/patroni "$@"
Dockerfile:
FROM registry-local.kube-system.svc/sas-src/viya-4-x64_oci_linux_2-docker/sas-crunchy5-postgres:1.1.2-20230407.1680894190141
ENV PATH=/sg:/usr/pgsql-12/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
COPY patroni.sh /sg/patroni
How to test
To verify it it working:
grep core.*0 /proc/*/limits
, this should list many processes with core ulimit set to 0ps -ef | grep pg12
to locate postgres process idkill -11 $PID
to force process to core dump
Overview
My data volume appears to be filling up with "core.$N" files.
Environment
Please provide the following details:
ubi8-5.0.0-0
13
Steps to Reproduce
REPRO
Provide steps to get to the error condition:
/pgdata/pg13/
folder: notice acore.
file.EXPECTED
Any files not required for operation of postgres or holding my data should not be kept on a persistent volume.
ACTUAL
core.N
files are created in PGDATA and never cleaned up.Logs