Closed HuangQAQ closed 2 months ago
@HuangQAQ sorry to hear you are having trouble!
I did dig into this a bit, and was unable to reproduce the issue you described. More specifically, failovers appear to be working normally using the spec you provided.
Additionally, the following Patroni issue indicates that this is related to your underlying Kubernetes/OpenShift infrastructure:
https://github.com/patroni/patroni/issues/1729
ERROR: ObjectCache.run ProtocolError('Connection broken: IncompleteRead(0 bytes read)
-- this line means the WATCH connection to the API was broken, probably because the K8s master node is going off. Therefore the PATCH endpoint was either sent to the old master node or to the new one. It looks like a concurrency issue with K8s API. Old master node should be removed from the service before the shutdown and the new node should not be added before it becomes 100% ready.
I therefore recommend checking the overall health of your underlying OpenShift cluster (e.g. was maintenance impacting the Kubernetes API occurring when you saw this? etc.) to ensure requests to the Kubernetes API server are working properly. Because ultimately this does appear to be an issue with your Kubernetes API (which Patroni is simply trying to interact with), rather than anything specific to CPK.
@andrewlecuyer You're absolutely right! I directly shut down the physical machine hosting a k8s/openshift node, which caused the situation described above. Thank you for your response! I will look for more information from Patroni.
To clarify, the phenomenon I mentioned occurs when I directly shut down the power of one of the k8s nodes among the three machines in the k8s cluster. Can this kind of disaster recovery operation be handled by the operator?
Hi @HuangQAQ, for this sort of problem, I wonder if you might be helped by the patroni failsafe option: https://patroni.readthedocs.io/en/master/dcs_failsafe_mode.html
(The point of the failsafe option is to avoid a failure to update the leader lock, which could happen with network failure.)
To include Patroni customization, you can follow this doc: https://access.crunchydata.com/documentation/postgres-operator/latest/tutorials/day-two/customize-cluster#custom-postgres-configuration )
Three Node OpenShift
)4.14.0
)ubi8-5.5.0-0
)14
)CR.yaml:
log from replica:
readiness is error,but pod cannot recover,because container: "database" does not exit.
In a three-node PostgreSQL cluster, when one primary node fails, the remaining two healthy nodes should re-elect a primary node. However, it appears that these two replica nodes do not re-elect a primary node. Instead, the database containers of the two nodes keep encountering errors repeatedly, and the pods do not exit to self-heal.
To clarify, the phenomenon I mentioned occurs when I directly shut down the power of one of the k8s nodes among the three machines in the k8s cluster. Can this kind of disaster recovery operation be handled by the operator?