Closed David-Angel closed 8 months ago
Hey @David-Angel, I did a quick test using two nodes and the primary was recreated on the working node. However, I just noticed that you mentioned sync replication, so I did not test that. If you want to share your spec I'll try to replicate the issue again.
Here is a cut down version with the relevant info.
cat <<EOF | k apply -f -
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
finalizers:
- postgres-operator.crunchydata.com/finalizer
name: postgres-ha
namespace: default
spec:
customReplicationTLSSecret:
name: postgres-ha-repl-tls
customTLSSecret:
name: postgres-ha-tls
instances:
- dataVolumeClaimSpec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
storageClassName: local-path-with-retention
name: test1
replicas: 3
resources:
requests:
cpu: 200m
memory: 512Mi
patroni:
dynamicConfiguration:
postgresql:
parameters:
log_destination: stderr
log_hostname: false
log_line_prefix: 'postgres, %t, app=%a, db=%d, user=%u, host=%h, %p: '
log_statement: none
logging_collector: "off"
max_connections: 300
max_wal_size: 2GB
max_worker_processes: 8
pgnodemx.kdapi_enabled: true
pgnodemx.kdapi_path: /etc/database-containerinfo
shared_buffers: 64MB
shared_preload_libraries: timescaledb,pgnodemx
synchronous_commit: "on"
timescaledb.license: apache
timescaledb.max_background_workers: 8
timescaledb.telemetry_level: "off"
wal_keep_size: 10GB
synchronous_mode: true
synchronous_node_count: 2
leaderLeaseDurationSeconds: 30
port: 8008
syncPeriodSeconds: 10
port: 5432
postgresVersion: 14
EOF
I tried again with sync replication enabled and was still unable to replicate the issue. Do you have any logs that might point at the issue? How are you taking the node offline?
We are using this to shutdown kubectl patch $(k get postgrescluster -o name) --type merge --patch '{"spec":{"shutdown": true}}'
No logs to speak of. For taking the node offline we shut it down in vSphere.
@David-Angel I did some testing for this scenario as well and wasn't able to replicate your error scenario either. It might be helpful to have the describe
output from any pending objects (such as StatefulSets, PVCs, etc). Without more to go on, I don't think we'll be able to properly identify the source of the error. If you're able to provide more information, I'd be happy to take a closer look.
Since we haven't heard back on this issue for some time, I am closing this issue. If you need further assistance, feel free to re-open this issue or ask a question in our Discord server.
Overview
After performing a cluster shutdown the cluster will never come up if the node the lead was on is removed or can not be brought back online.
Environment
Kubernetes, 3 or more node cluster using sync replication
Steps to Reproduce
REPRO
Follow the instructions for shutdown kubectl patch postgrescluster/hippo -n postgres-operator --type merge --patch '{"spec":{"shutdown": true}}'
Remove or disable the node the lead was on.
This is to verify we can recover the cluster in a disaster scenario.
Start the cluster back up kubectl patch postgrescluster/hippo -n postgres-operator --type merge --patch '{"spec":{"shutdown": false}}'
The lead pod will remain in pending state forever. Performing a failover or switchover fails as the target replica must be online.
The workaround I have come up for this is
It would be very nice to have a simpler method of resolving this please.