A three-node PostgreSQL cluster cannot self-heal after a failure.

HuangQAQ commented 3 months ago

Platform: (Three Node OpenShift)
Platform Version: ( 4.14.0)
PGO Image Tag: (ubi8-5.5.0-0)
Postgres Version (14)
Storage: (cubefs)

CR.yaml:

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: hippo-ha
  namespace: helm-charts
spec:
  postgresVersion: 14
  instances:
    - name: pgha1
      replicas: 3
      dataVolumeClaimSpec:
        storageClassName: cfs-sc-nvme
        accessModes:
        - "ReadWriteOnce"
        resources:
          requests:
            storage: 8Gi
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            podAffinityTerm:
              topologyKey: kubernetes.io/hostname
              labelSelector:
                matchLabels:
                  postgres-operator.crunchydata.com/cluster: hippo-ha
                  postgres-operator.crunchydata.com/instance-set: pgha1
  backups:
    pgbackrest:
      repos:
      - name: repo2
        global:
          repo1-retention-full: "14"
          repo1-retention-full-type: time
        schedules:
          full: "0 1 * * 0"
          differential: "0 1 * * 1-6"
        volume:
          volumeClaimSpec:
            storageClassName: longhorn
            accessModes:
            - "ReadWriteOnce"
            resources:
              requests:
                storage: 8Gi
  patroni:
    dynamicConfiguration:
      postgresql:
        parameters:
          max_connections: 10000
          shared_buffers: 256MB
        pg_hba:
          - "host all all 0.0.0.0/0 trust" # this line enabled logical replication with programmatic access
          - "host all postgres 127.0.0.1/32 md5"

log from replica:

2024-06-17 11:35:59,628 INFO: no action. I am (hippo-ha-pgha1-4mmp-0), a secondary, and following a leader (hippo-ha-pgha1-sxct-0)
2024-06-17 11:36:09,618 INFO: no action. I am (hippo-ha-pgha1-4mmp-0), a secondary, and following a leader (hippo-ha-pgha1-sxct-0)
2024-06-17 11:36:17,854 WARNING: Request failed to hippo-ha-pgha1-sxct-0: GET https://hippo-ha-pgha1-sxct-0.hippo-ha-pods:8008/patroni (HTTPSConnectionPool(host='hippo-ha-pgha1-sxct-0.hippo-ha-pods', port=8008): Max retries exceeded with url: /patroni (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f8cd511f6d8>: Failed to establish a new connection: [Errno 111] Connection refused',)))
2024-06-17 11:36:17,916 INFO: Got response from hippo-ha-pgha1-9dg7-0 https://hippo-ha-pgha1-9dg7-0.hippo-ha-pods:8008/patroni: {"state": "running", "postmaster_start_time": "2024-06-04 10:26:59.310923+00:00", "role": "replica", "server_version": 140010, "xlog": {"received_location": 75765907456, "replayed_location": 75765907456, "replayed_timestamp": "2024-06-17 11:32:44.134317+00:00", "paused": false}, "timeline": 3, "replication_state": "in archive recovery", "cluster_unlocked": true, "dcs_last_seen": 1718624177, "database_system_identifier": "7374317937362333789", "patroni": {"version": "3.1.1", "scope": "hippo-ha-ha"}}
2024-06-17 11:36:17,962 INFO: Could not take out TTL lock
2024-06-17 11:36:17,963 ERROR: ObjectCache.run ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
2024-06-17 11:36:17,964 ERROR: ObjectCache.run ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
2024-06-17 11:36:18,628 ERROR: Unexpected exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1871, in run_cycle
    info = self._run_cycle()
  File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1783, in _run_cycle
    ret = self.process_unhealthy_cluster()
  File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1286, in process_unhealthy_cluster
    'following new leader after trying and failing to obtain lock')
  File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 618, in follow
    self.state_handler.follow(node_to_follow, role, do_reload=True)
  File "/usr/local/lib/python3.6/site-packages/patroni/postgresql/__init__.py", line 1087, in follow
    self.config.write_postgresql_conf()
  File "/usr/local/lib/python3.6/site-packages/patroni/postgresql/config.py", line 465, in write_postgresql_conf
    self._sanitize_auto_conf()
  File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.6/site-packages/patroni/postgresql/config.py", line 393, in config_writer
    yield writer
  File "/usr/local/lib/python3.6/site-packages/patroni/postgresql/config.py", line 240, in __exit__
    self._fd.close()
OSError: [Errno 5] Input/output error
2024-06-17 11:36:18,635 INFO: Unexpected exception raised, please report it as a BUG
2024-06-17 11:36:18,637 INFO: Lock owner: hippo-ha-pgha1-9dg7-0; I am hippo-ha-pgha1-4mmp-0
2024-06-17 11:36:19,108 ERROR: Unexpected exception

readiness is error，but pod cannot recover，because container: "database" does not exit.

In a three-node PostgreSQL cluster, when one primary node fails, the remaining two healthy nodes should re-elect a primary node. However, it appears that these two replica nodes do not re-elect a primary node. Instead, the database containers of the two nodes keep encountering errors repeatedly, and the pods do not exit to self-heal.

To clarify, the phenomenon I mentioned occurs when I directly shut down the power of one of the k8s nodes among the three machines in the k8s cluster. Can this kind of disaster recovery operation be handled by the operator?

andrewlecuyer commented 3 months ago

@HuangQAQ sorry to hear you are having trouble!

I did dig into this a bit, and was unable to reproduce the issue you described. More specifically, failovers appear to be working normally using the spec you provided.

Additionally, the following Patroni issue indicates that this is related to your underlying Kubernetes/OpenShift infrastructure:

https://github.com/patroni/patroni/issues/1729

ERROR: ObjectCache.run ProtocolError('Connection broken: IncompleteRead(0 bytes read) -- this line means the WATCH connection to the API was broken, probably because the K8s master node is going off. Therefore the PATCH endpoint was either sent to the old master node or to the new one. It looks like a concurrency issue with K8s API. Old master node should be removed from the service before the shutdown and the new node should not be added before it becomes 100% ready.

I therefore recommend checking the overall health of your underlying OpenShift cluster (e.g. was maintenance impacting the Kubernetes API occurring when you saw this? etc.) to ensure requests to the Kubernetes API server are working properly. Because ultimately this does appear to be an issue with your Kubernetes API (which Patroni is simply trying to interact with), rather than anything specific to CPK.

HuangQAQ commented 3 months ago

@andrewlecuyer You're absolutely right! I directly shut down the physical machine hosting a k8s/openshift node, which caused the situation described above. Thank you for your response! I will look for more information from Patroni.

HuangQAQ commented 3 months ago

To clarify, the phenomenon I mentioned occurs when I directly shut down the power of one of the k8s nodes among the three machines in the k8s cluster. Can this kind of disaster recovery operation be handled by the operator?

benjaminjb commented 3 months ago

Hi @HuangQAQ, for this sort of problem, I wonder if you might be helped by the patroni failsafe option: https://patroni.readthedocs.io/en/master/dcs_failsafe_mode.html

(The point of the failsafe option is to avoid a failure to update the leader lock, which could happen with network failure.)

To include Patroni customization, you can follow this doc: https://access.crunchydata.com/documentation/postgres-operator/latest/tutorials/day-two/customize-cluster#custom-postgres-configuration )

CrunchyData / postgres-operator

A three-node PostgreSQL cluster cannot self-heal after a failure. #3935