BCDevOps / backup-container

A simple container for a simple backup strategy.
Apache License 2.0
42 stars 56 forks source link

Backup container has lots of defunct postgres servers running #33

Closed gil0109 closed 4 years ago

gil0109 commented 4 years ago

Starting and shutdown the postgres server within the backup.sh script causes defunct processes. If you start and shutdown the postgres server from inside the pod rsh terminal, this does not occur. If you have the script start the server (and not automatically shutdown the server), and you manually shutdown the server, you also receive the defunct process.

sh-4.2$ ps aux
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1002380+      1  0.0  0.0 385116  2344 ?        Ssl  Nov23   0:00 go-crond -v --allow-unprivileged backup.conf
1002380+    501  0.0  0.0      0     0 ?        Z    Nov24   0:00 [postgres] <defunct>
1002380+    769  0.0  0.0      0     0 ?        Z    Nov24   0:00 [postgres] <defunct>
1002380+    793  0.0  0.0      0     0 ?        Zs   Nov24   0:00 [postgres] <defunct>
1002380+    840  0.0  0.0      0     0 ?        Zs   Nov24   0:00 [postgres] <defunct>
1002380+   1321  0.0  0.0      0     0 ?        Z    Nov25   0:00 [postgres] <defunct>
1002380+   1566  0.0  0.0      0     0 ?        Z    Nov25   0:00 [postgres] <defunct>
1002380+   1567  0.0  0.0      0     0 ?        Zs   Nov25   0:00 [postgres] <defunct>
1002380+   1659  0.0  0.0      0     0 ?        Zs   Nov25   0:00 [postgres] <defunct>
1002380+   2145  0.0  0.0      0     0 ?        Z    Nov26   0:00 [postgres] <defunct>
1002380+   2390  0.0  0.0      0     0 ?        Z    Nov26   0:00 [postgres] <defunct>
1002380+   2391  0.0  0.0      0     0 ?        Zs   Nov26   0:00 [postgres] <defunct>
1002380+   2461  0.0  0.0      0     0 ?        Zs   Nov26   0:00 [postgres] <defunct>
1002380+   2947  0.0  0.0      0     0 ?        Z    12:00   0:00 [postgres] <defunct>
1002380+   3192  0.0  0.0      0     0 ?        Z    12:00   0:00 [postgres] <defunct>
1002380+   3193  0.0  0.0      0     0 ?        Zs   12:00   0:00 [postgres] <defunct>
1002380+   3263  0.0  0.0      0     0 ?        Zs   12:00   0:00 [postgres] <defunct>
1002380+   3366  0.5  0.0  11832  1712 ?        Ss   18:54   0:00 /bin/sh
1002380+   3376  0.0  0.0  51756  1712 ?        R+   18:54   0:00 ps aux
sh-4.2$ 
WadeBarnes commented 4 years ago

Great find

WadeBarnes commented 4 years ago

Adding the link to the article you found that may point us in the right direction of a solution;

WadeBarnes commented 4 years ago

It appears to be the postgres logger processes that are going defunct;

USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1002170+      1  0.0  0.0 304604  3184 ?        Ssl  13:22   0:00 go-crond -v --default-user=1002170000 --allow-unprivileged backup.conf
1002170+     18  0.0  0.0  11964  1896 ?        Ss   13:22   0:00 /bin/sh
1002170+    747  0.0  0.0  11840  1836 ?        Ss+  13:24   0:00 /bin/sh
1002170+    920  0.0  0.0  11840  1532 ?        Ss+  13:27   0:00 /bin/sh
1002170+    935  0.0  0.0  11704  1220 ?        S    13:29   0:00 sh -c (./backup.sh -s -v wallet-indy-cat:5432/agent_indy_cat_wallet)
1002170+    937  0.1  0.0  12108  1596 ?        S    13:29   0:00 sh -c (./backup.sh -s -v wallet-indy-cat:5432/agent_indy_cat_wallet)
1002170+    973  0.0  0.0  11840  1716 ?        S    13:29   0:00 /bin/bash /usr/bin/run-postgresql
1002170+   1478  1.0  0.0 159240 10692 ?        S    13:30   0:00 /opt/rh/rh-postgresql10/root/usr/bin/postgres -h
1002170+   1489  0.0  0.0 116216  1784 ?        Ss   13:30   0:00 postgres: logger process   
1002170+   1491  0.0  0.0 159380  3004 ?        Ss   13:30   0:00 postgres: checkpointer process   
1002170+   1492  0.0  0.0 159240  2012 ?        Ss   13:30   0:00 postgres: writer process   
1002170+   1493  0.0  0.0 159240  2000 ?        Ss   13:30   0:00 postgres: wal writer process   
1002170+   1494  0.0  0.0 159664  2888 ?        Ss   13:30   0:00 postgres: autovacuum launcher process   
1002170+   1495  0.0  0.0 118332  1884 ?        Ss   13:30   0:00 postgres: stats collector process   
1002170+   1496  0.0  0.0 159564  2380 ?        Ss   13:30   0:00 postgres: bgworker: logical replication launcher  
1002170+   1504  0.0  0.0  72512  2800 ?        S    13:30   0:00 createdb --owner=User_AahE2aJR agent_indy_cat_wallet
1002170+   1505  4.0  0.0 160156  5404 ?        Ss   13:30   0:00 postgres: postgres postgres [local] CREATE DATABASE
1002170+   1553  0.0  0.0   4376   372 ?        S    13:30   0:00 sleep 1
1002170+   1555  0.0  0.0  51764  1724 ?        R+   13:30   0:00 ps aux
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1002170+      1  0.0  0.0 304604  3184 ?        Ssl  13:22   0:00 go-crond -v --default-user=1002170000 --allow-unprivileged backup.conf
1002170+     18  0.0  0.0  11964  1896 ?        Ss   13:22   0:00 /bin/sh
1002170+    747  0.0  0.0  11840  1836 ?        Ss+  13:24   0:00 /bin/sh
1002170+    920  0.0  0.0  11840  1532 ?        Ss+  13:27   0:00 /bin/sh
1002170+    935  0.0  0.0  11704  1220 ?        S    13:29   0:00 sh -c (./backup.sh -s -v wallet-indy-cat:5432/agent_indy_cat_wallet)
1002170+    937  0.1  0.0  12108  1596 ?        S    13:29   0:00 sh -c (./backup.sh -s -v wallet-indy-cat:5432/agent_indy_cat_wallet)
1002170+    973  0.1  0.0 159188 10604 ?        S    13:29   0:00 postgres
1002170+   1478  0.7  0.0      0     0 ?        Z    13:30   0:00 [postgres] <defunct>
1002170+   1489  0.0  0.0      0     0 ?        Zs   13:30   0:00 [postgres] <defunct>
1002170+   1578  0.0  0.0   4376   368 ?        S    13:30   0:00 sleep 1
1002170+   1586  0.0  0.0 116164  1772 ?        Ss   13:30   0:00 postgres: logger process  
1002170+   1588  0.0  0.0 159188  1992 ?        Ss   13:30   0:00 postgres: checkpointer process  
1002170+   1589  0.0  0.0 159188  1996 ?        Ss   13:30   0:00 postgres: writer process  
1002170+   1590  0.0  0.0 159188  1988 ?        Ss   13:30   0:00 postgres: wal writer process  
1002170+   1591  0.0  0.0 159612  2896 ?        Ss   13:30   0:00 postgres: autovacuum launcher process  
1002170+   1592  0.0  0.0 118416  1984 ?        Ss   13:30   0:00 postgres: stats collector process  
1002170+   1593  0.0  0.0 159600  2376 ?        Ss   13:30   0:00 postgres: bgworker: logical replication launcher  
1002170+   1594  0.0  0.0  51764  1720 ?        R+   13:30   0:00 ps aux
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1002170+      1  0.0  0.0 304604  3184 ?        Ssl  13:22   0:00 go-crond -v --default-user=1002170000 --allow-unprivileged backup.conf
1002170+     18  0.0  0.0  11964  1896 ?        Ss   13:22   0:00 /bin/sh
1002170+    747  0.0  0.0  11840  1836 ?        Ss+  13:24   0:00 /bin/sh
1002170+    920  0.0  0.0  11840  1532 ?        Ss+  13:27   0:00 /bin/sh
1002170+   1478  0.0  0.0      0     0 ?        Z    13:30   0:00 [postgres] <defunct>
1002170+   1489  0.0  0.0      0     0 ?        Zs   13:30   0:00 [postgres] <defunct>
1002170+   1586  0.0  0.0      0     0 ?        Zs   13:30   0:00 [postgres] <defunct>
1002170+   1750  0.0  0.0  51764  1720 ?        R+   13:34   0:00 ps aux
WadeBarnes commented 4 years ago

Three defunct processes are created per verification run. Two of those can be eliminated by turning off the logging process logging_collector = off in the postgres config.

USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1002170+      1  0.0  0.0 303196  8020 ?        Ssl  14:55   0:00 go-crond -v --default-user=1002170000 --allow-unprivileged backup.conf
1002170+     17  0.0  0.0  11964  1920 ?        Ss+  14:56   0:00 /bin/sh
1002170+    597  0.0  0.0  11840  1836 ?        Ss   14:57   0:00 /bin/sh
1002170+   1393  0.0  0.0      0     0 ?        Z    15:00   0:00 [postgres] <defunct>
1002170+   1645  0.0  0.0  51764  1720 ?        R+   15:02   0:00 ps aux
WadeBarnes commented 4 years ago

~This leads me to believe the issue is somehow being caused by how stdout and stderr are being redirected by the cron process.~

WadeBarnes commented 4 years ago

The real issue is described here; Docker and the PID 1 zombie reaping problem.

Working on a fix.