Open AlexFavre opened 2 years ago
Could you please provide more details. 1) What size was the raft file? Does it growing infinitely and you run out of space? 2) You have 4 nodes in raft cluster. 1 node was full space, 3 nodes were good (no space problems) but raft did not switched the leader to another node, do I understand correctly?
What size was the raft file? Does it growing infinitely and you run out of space? it's seems like that's
You have 4 nodes in raft cluster. 1 node was full space, 3 nodes were good (no space problems) but raft did not switched the leader to another node, do I understand correctly? Yes
Leader have this: ls -allh /apps/pgsql/raft/ total 907M drwxr-xr-x 2 postgres postgres 4.0K Aug 24 18:32 . drwxr-x--- 5 postgres postgres 41 Sep 20 2021 .. -rw-r--r-- 1 postgres postgres 2.3K Aug 24 18:30 1.1.1.1:5010.dump -rw-r--r-- 1 postgres postgres 1.0G Aug 24 18:32 1.1.1.1:5010.journal -rw-r--r-- 1 postgres postgres 34 Aug 24 18:32 1.1.1.1:5010.journal.meta
second server have this: ls -allh /apps/pgsql/raft/ total 76K drwxr-xr-x 2 postgres postgres 4.0K Aug 24 18:34 . drwxr-x--- 6 postgres postgres 55 Feb 11 2022 .. -rw-r--r-- 1 postgres postgres 2.3K Aug 24 18:32 1.1.1.2:5010.dump -rw-r--r-- 1 postgres postgres 64K Aug 24 18:33 1.1.1.2:5010.journal -rw-r--r-- 1 postgres postgres 34 Aug 24 18:34 1.1.1.2:5010.journal.meta
Thanks for report, this is not expected behavior, we have a log compaction, probably it's disabled in patroni by default, I will check.
Thanks for your answer
For now I haven't found the reason why file is growing, but I will continue investigating and fix this case.
Hi, Describe the bug Raft leader doesn't switch also the Postgres leader switch in RO mode We have four node for raft (two dedicated and two with postgres)
To Reproduce have a raft folder without space
Expected behavior The folder of our raft leader has been full (no space left) also the raft cluster have wait the previous leader and Patroni cluster fall in error (leader in RO mode and Replica can't start).
To fix this issue we have increase space of raft folder but we think it's not the best way. How we can purge raft file to save space ? Why raft cluster doesn't switch?
Environment
extract from Patroni configuration file
cat /etc/patroni/patroni.yml scope: PGSQLGrafanaProd namespace: /PGSQLGrafana/ name: PGSQLGrafanaProd1
raft: data_dir: /apps/pgsql/raft self_addr: 1.1.1.1:5010 partner_addrs: ['1.1.1.2:5010','1.1.1.3:5010','1.1.1.4:5010']
Have you checked Patroni logs? Before crash: Aug 22 20:39:51 s01vl9950891 systemd[1]: patroni.service: Failed with result 'signal'. Aug 22 20:39:51 s01vl9950891 systemd[1]: patroni.service: Main process exited, code=killed, status=11/SEGV Aug 22 20:39:46 s01vl9950891 patroni[1906]: 2022-08-22 20:39:46,414 INFO: no action. I am (PGSQLGrafanaProd1), the leader with the lock Aug 22 20:39:45 s01vl9950891 patroni[1906]: 2022-08-22 20:39:45,672 WARNING: Failed to store full dump Aug 22 20:39:40 s01vl9950891 patroni[1906]: 2022-08-22 20:39:40,751 WARNING: Failed to store full dump Aug 22 20:39:36 s01vl9950891 patroni[1906]: 2022-08-22 20:39:36,413 INFO: no action. I am (PGSQLGrafanaProd1), the leader with the lock Aug 22 20:39:35 s01vl9950891 patroni[1906]: 2022-08-22 20:39:35,927 WARNING: Failed to store full dump Aug 22 20:39:30 s01vl9950891 patroni[1906]: 2022-08-22 20:39:30,973 WARNING: Failed to store full dump
When we have try to run daemon manually: Aug 23 09:47:46 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:46,564 WARNING: Loop time exceeded, rescheduling immediately. Aug 23 09:47:46 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:46,562 INFO: starting after demotion in progress Aug 23 09:47:46 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:34,561 INFO: Lock owner: PGSQLGrafanaProd1; I am PGSQLGrafanaProd1 Aug 23 09:47:44 s01vl9950891 patroni[1212815]: 10.244.120.156:5432 - accepting connections Aug 23 09:47:42 s01vl9950891 patroni[1212815]: 10.244.120.156:5432 - rejecting connections Aug 23 09:47:41 s01vl9950891 patroni[1212815]: 10.244.120.156:5432 - rejecting connections Aug 23 09:47:40 s01vl9950891 patroni[1212815]: 10.244.120.156:5432 - rejecting connections Aug 23 09:47:39 s01vl9950891 patroni[1212815]: 10.244.120.156:5432 - rejecting connections Aug 23 09:47:38 s01vl9950891 patroni[1212815]: 10.244.120.156:5432 - rejecting connections Aug 23 09:47:37 s01vl9950891 patroni[1212815]: 10.244.120.156:5432 - rejecting connections Aug 23 09:47:36 s01vl9950891 patroni[1212815]: 10.244.120.156:5432 - rejecting connections Aug 23 09:47:36 s01vl9950891 patroni[1212815]: 10.244.120.156:5432 - rejecting connections Aug 23 09:47:35 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:35.222 CEST [1212938] HINT: Future log output will appear in directory "/apps/pgsql/log". Aug 23 09:47:35 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:35.222 CEST [1212938] LOG: redirecting log output to logging collector process Aug 23 09:47:35 s01vl9950891 patroni[1212815]: 10.244.120.156:5432 - no response Aug 23 09:47:35 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:35,064 INFO: postmaster pid=1212938 Aug 23 09:47:34 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:34,561 WARNING: Loop time exceeded, rescheduling immediately. Aug 23 09:47:34 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:34,560 INFO: closed patroni connection to the postgresql cluster Aug 23 09:47:34 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:34,559 INFO: demoted self because failed to update leader lock in DCS Aug 23 09:47:32 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:32,324 INFO: Demoting self (immediate-nolock) Aug 23 09:47:32 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:32,324 ERROR: failed to update leader lock Aug 23 09:47:32 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:22,322 INFO: Lock owner: PGSQLGrafanaProd1; I am PGSQLGrafanaProd1 Aug 23 09:47:22 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:22,205 INFO: establishing a new patroni connection to the postgres cluster Aug 23 09:47:22 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:22,100 INFO: No PostgreSQL configuration items changed, nothing to reload. Aug 23 09:47:22 s01vl9950891 patroni[1212815]: OSError: [Errno 28] No space left on device Aug 23 09:47:22 s01vl9950891 patroni[1212815]: self.incomingTransmissionFile.close() Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/var/lib/pgsql/.local/lib/python3.6/site-packages/pysyncobj/serializer.py", line 193, in setTransmissionData Aug 23 09:47:22 s01vl9950891 patroni[1212815]: if self.serializer.setTransmissionData(serialized): Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/var/lib/pgsql/.local/lib/python3.6/site-packages/pysyncobj/syncobj.py", line 896, in onMessageReceived Aug 23 09:47:22 s01vl9950891 patroni[1212815]: self._onMessageReceivedCallback(node, message) Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/var/lib/pgsql/.local/lib/python3.6/site-packages/pysyncobj/transport.py", line 107, in _onMessageReceived Aug 23 09:47:22 s01vl9950891 patroni[1212815]: self.onMessageReceived(message) Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/var/lib/pgsql/.local/lib/python3.6/site-packages/pysyncobj/tcp_connection.py", line 173, in processConnection Aug 23 09:47:22 s01vl9950891 patroni[1212815]: self.descrToCallbacks[descr](descr, eventMask) Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/var/lib/pgsql/.local/lib/python3.6/site-packages/pysyncobj/poller.py", line 97, in poll Aug 23 09:47:22 s01vl9950891 patroni[1212815]: self._poller.poll(timeToWait) Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/var/lib/pgsql/.local/lib/python3.6/site-packages/pysyncobj/syncobj.py", line 616, in _onTick Aug 23 09:47:22 s01vl9950891 patroni[1212815]: super(DynMemberSyncObj, self)._onTick(timeToWait) Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/usr/lib/python3.6/site-packages/patroni/dcs/raft.py", line 84, in _onTick Aug 23 09:47:22 s01vl9950891 patroni[1212815]: super(KVStoreTTL, self)._onTick(timeToWait) Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/usr/lib/python3.6/site-packages/patroni/dcs/raft.py", line 238, in _onTick Aug 23 09:47:22 s01vl9950891 patroni[1212815]: self._onTick(timeToWait) Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/var/lib/pgsql/.local/lib/python3.6/site-packages/pysyncobj/syncobj.py", line 532, in doTick Aug 23 09:47:22 s01vl9950891 patroni[1212815]: self.doTick(self.conf.autoTickPeriod) Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/usr/lib/python3.6/site-packages/patroni/dcs/raft.py", line 248, in _autoTickThread Aug 23 09:47:22 s01vl9950891 patroni[1212815]: self._target(*self._args, **self._kwargs) Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/usr/lib64/python3.6/threading.py", line 867, in run Aug 23 09:47:22 s01vl9950891 patroni[1212815]: self.run() Aug 23 09:47:22 s01vl9950891 patroni[1212815]: File "/usr/lib64/python3.6/threading.py", line 919, in _bootstrap_inner Aug 23 09:47:22 s01vl9950891 patroni[1212815]: Traceback (most recent call last): Aug 23 09:47:22 s01vl9950891 patroni[1212815]: Exception in thread Thread-2: Aug 23 09:47:17 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:17,084 INFO: waiting on raft Aug 23 09:47:12 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:12,071 INFO: waiting on raft Aug 23 09:47:07 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:07,060 INFO: waiting on raft Aug 23 09:47:02 s01vl9950891 patroni[1212815]: 2022-08-23 09:47:02,049 INFO: waiting on raft