apecloud / kubeblocks

KubeBlocks is an open-source control plane software that runs and manages databases, message queues and other stateful applications on K8s.
https://kubeblocks.io
GNU Affero General Public License v3.0
2.08k stars 170 forks source link

[BUG] pg/redis cluster role is none after kubeblocks upgrade from 0.5 to 0.6 #4796

Closed JashBook closed 1 year ago

JashBook commented 1 year ago

Describe the bug pg/redis cluster role is none after kubeblocks upgrade from 0.5 to 0.6 do hscale then do other restart ops

To Reproduce Steps to reproduce the behavior:

  1. install kubeblocks 0.5.1
    kbcli kubeblocks install --version 0.5.1 --namespace default  
  2. create pg/redis cluster
    kbcli cluster create postgres-lyppow --termination-policy=WipeOut --cluster-definition=postgresql --cluster-version=postgresql-14.7.2 --set-file=-<<EOF
    - name: postgresql
    componentDefRef: postgresql
    replicas: 1
    resources:
    requests:
      cpu: 100m
      memory: 0.5Gi
    volumeClaimTemplates:
    - name: data
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 20Gi
        storageClassName: alicloud-disk-essd
    EOF
kbcli cluster create redis-jnhqlr --termination-policy=WipeOut --cluster-definition=redis --set-file=-<<EOF
- name: redis
  componentDefRef: redis
  replicas: 1
  resources:
    requests:
      cpu: 100m
      memory: 0.5Gi
  volumeClaimTemplates:
    - name: data
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 20Gi
        storageClassName: alicloud-disk-essd
- name: redis-sentinel
  componentDefRef: redis-sentinel
  replicas: 1
  resources:
    requests:
      cpu: 100m
      memory: 0.5Gi
  volumeClaimTemplates:
    - name: data
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 20Gi
        storageClassName: alicloud-disk-essd
EOF
  1. upgrade kubeblocks to 0.6
    kbcli kubeblocks upgrade --version 0.6.0-beta.44  --namespace default
  2. hscale cluster pg/redis
    
    kbcli cluster hscale postgres-lyppow --replicas 3 --components postgresql --auto-approve 

kbcli cluster hscale redis-jnhqlr --auto-approve --components redis-sentinel --replicas 3 --namespace default

kbcli cluster hscale redis-jnhqlr --components redis --replicas 4 --auto-approve

5. cluster pg/redis vscale or other restart ops

kbcli cluster vscale postgres-lyppow --components postgresql --cpu 200m --memory 0.6Gi --auto-approve

kbcli cluster vscale redis-jnhqlr --components redis --cpu 200m --memory 0.6Gi --auto-approve

6. see error

kubectl get pod NAME READY STATUS RESTARTS AGE postgres-lyppow-postgresql-0 5/5 Running 0 14m postgres-lyppow-postgresql-1 5/5 Running 0 17m postgres-lyppow-postgresql-2 4/5 Running 0 16m redis-jnhqlr-redis-0 2/3 CrashLoopBackOff 7 (57s ago) 17m redis-jnhqlr-redis-1 3/3 Running 0 18m redis-jnhqlr-redis-2 3/3 Running 0 18m redis-jnhqlr-redis-3 3/3 Running 0 17m redis-jnhqlr-redis-sentinel-0 1/1 Running 0 44m redis-jnhqlr-redis-sentinel-1 1/1 Running 0 36m

kubectl get cluster NAME CLUSTER-DEFINITION VERSION TERMINATION-POLICY STATUS AGE postgres-lyppow postgresql postgresql-14.7.2 Delete Updating 55m redis-jnhqlr redis redis-7.0.6 DoNotTerminate Updating 55m

kubectl get sts NAME READY AGE postgres-lyppow-postgresql 2/3 55m redis-jnhqlr-redis 3/4 55m redis-jnhqlr-redis-sentinel 3/3 55m

kubectl get ops NAME TYPE CLUSTER STATUS PROGRESS AGE postgres-lyppow-horizontalscaling-jtsz2 HorizontalScaling postgres-lyppow Succeed 2/2 37m postgres-lyppow-verticalscaling-7f6pk VerticalScaling postgres-lyppow Running 2/3 18m redis-jnhqlr-horizontalscaling-2trsw HorizontalScaling redis-jnhqlr Succeed 3/3 38m redis-jnhqlr-horizontalscaling-8d8mq HorizontalScaling redis-jnhqlr Succeed 2/2 36m redis-jnhqlr-verticalscaling-mgj9k VerticalScaling redis-jnhqlr Running 3/4 19m

7. describe  pg

kubectl describe cluster postgres-lyppow Name: postgres-lyppow Namespace: default Labels: clusterdefinition.kubeblocks.io/name=postgresql clusterversion.kubeblocks.io/name=postgresql-14.7.2 Annotations: kubeblocks.io/ops-request: [{"name":"postgres-lyppow-verticalscaling-7f6pk","type":"VerticalScaling"}] kubeblocks.io/reconcile: 2023-08-18T02:55:21.10250995Z API Version: apps.kubeblocks.io/v1alpha1 Kind: Cluster Metadata: Creation Timestamp: 2023-08-18T02:01:09Z Finalizers: cluster.kubeblocks.io/finalizer Generation: 4 Resource Version: 2610689 UID: 549d46e3-e147-48dd-9fc4-9399f34af341 Spec: Affinity: Pod Anti Affinity: Preferred Tenancy: SharedNode Cluster Definition Ref: postgresql Cluster Version Ref: postgresql-14.7.2 Component Specs: Class Def Ref: Class:
Component Def Ref: postgresql Monitor: false Name: postgresql No Create PDB: false Replicas: 3 Resources: Limits: Cpu: 200m Memory: 644245094400m Requests: Cpu: 200m Memory: 644245094400m Service Account Name: kb-sa-postgres-lyppow Volume Claim Templates: Name: data Spec: Access Modes: ReadWriteOnce Resources: Requests: Storage: 20Gi Storage Class Name: alicloud-disk-essd Monitor: Resources: Cpu: 0 Memory: 0 Storage: Size: 0 Termination Policy: Delete Status: Cluster Def Generation: 3 Components: Postgresql: Phase: Updating Pods Ready: false Replication Set Status: Primary: Pod: postgres-lyppow-postgresql-0 Secondaries: Pod: postgres-lyppow-postgresql-2 Pod: postgres-lyppow-postgresql-1 Conditions: Last Transition Time: 2023-08-18T02:38:10Z Message: VerticalScaling opsRequest: postgres-lyppow-verticalscaling-7f6pk is processing Reason: VerticalScaling Status: False Type: LatestOpsRequestProcessed Last Transition Time: 2023-08-18T02:01:09Z Message: The operator has started the provisioning of Cluster: postgres-lyppow Observed Generation: 4 Reason: PreCheckSucceed Status: True Type: ProvisioningStarted Last Transition Time: 2023-08-18T02:41:22Z Message: the number of primary pod is not equal to 1, primary pods: [], emptyRole pods: [postgres-lyppow-postgresql-0] Reason: ApplyResourcesFailed Status: False Type: ApplyResources Last Transition Time: 2023-08-18T02:38:12Z Message: pods are not ready in Components: [postgresql], refer to related component message in Cluster.status.components Reason: ReplicasNotReady Status: False Type: ReplicasReady Last Transition Time: 2023-08-18T02:38:12Z Message: pods are unavailable in Components: [postgresql], refer to related component message in Cluster.status.components Reason: ComponentsNotReady Status: False Type: Ready Observed Generation: 4 Phase: Updating Events: Type Reason Age From Message


Normal PreCheckSucceed 58m cluster-controller The operator has started the provisioning of Cluster: postgres-lyppow Normal ApplyResourcesSucceed 58m cluster-controller Successfully applied for resources Normal AllReplicasReady 56m cluster-controller all pods of components are ready, waiting for the probe detection successful Normal ClusterReady 56m cluster-controller Cluster: postgres-lyppow is ready, current phase is Running Normal Running 56m cluster-controller Cluster: postgres-lyppow is ready, current phase is Running Normal SysAcctCreate 56m system-account-controller Created Accounts for cluster: postgres-lyppow, component: postgresql, accounts: kbadmin Normal SysAcctCreate 56m system-account-controller Created Accounts for cluster: postgres-lyppow, component: postgresql, accounts: kbprobe Normal SysAcctCreate 55m system-account-controller Created Accounts for cluster: postgres-lyppow, component: postgresql, accounts: kbreplicator Normal SysAcctCreate 55m system-account-controller Created Accounts for cluster: postgres-lyppow, component: postgresql, accounts: kbdataprotection Normal SysAcctCreate 55m system-account-controller Created Accounts for cluster: postgres-lyppow, component: postgresql, accounts: kbmonitoring Normal ComponentPhaseTransition 50m cluster-controller Running: false, PodsReady: false, PodsTimedout: false Warning Failed 50m cluster-controller Cluster: postgres-lyppow is Failed, check according to the components message Normal HorizontalScaling 40m ops-request-controller Start to process the HorizontalScaling opsRequest "postgres-lyppow-horizontalscaling-jtsz2" in Cluster: postgres-lyppow Warning ReplicasNotReady 40m (x2 over 50m) cluster-controller pods are not ready in Components: [postgresql], refer to related component message in Cluster.status.components Warning ComponentsNotReady 40m (x2 over 50m) cluster-controller pods are unavailable in Components: [postgresql], refer to related component message in Cluster.status.components Normal HorizontalScale 40m (x2 over 40m) cluster-controller start horizontal scale component postgresql of cluster postgres-lyppow from 1 to 3 Warning ApplyResourcesFailed 39m cluster-controller Operation cannot be fulfilled on statefulsets.apps "postgres-lyppow-postgresql": the object has been modified; please apply your changes to the latest version and try again Normal ClusterReady 38m (x2 over 48m) cluster-controller Cluster: postgres-lyppow is ready, current phase is Running Normal Running 38m (x2 over 48m) cluster-controller Cluster: postgres-lyppow is ready, current phase is Running Normal AllReplicasReady 38m (x2 over 48m) cluster-controller all pods of components are ready, waiting for the probe detection successful Normal Processed 38m cluster-controller HorizontalScaling opsRequest: postgres-lyppow-horizontalscaling-jtsz2 has been processed Normal ComponentPhaseTransition 38m (x2 over 48m) cluster-controller Running: true, PodsReady: true, PodsTimedout: false Normal ApplyResourcesSucceed 21m (x3 over 40m) cluster-controller Successfully applied for resources Normal PreCheckSucceed 21m (x2 over 40m) cluster-controller The operator has started the provisioning of Cluster: postgres-lyppow Normal ComponentPhaseTransition 21m (x2 over 40m) cluster-controller Component workload updated Normal VerticalScaling 21m ops-request-controller Start to process the VerticalScaling opsRequest "postgres-lyppow-verticalscaling-7f6pk" in Cluster: postgres-lyppow Warning VerticalScaling 21m cluster-controller VerticalScaling opsRequest: postgres-lyppow-verticalscaling-7f6pk is processing Normal WaitingForProbeSuccess 21m (x4 over 50m) cluster-controller Waiting for probe success Warning Unhealthy 15m (x2 over 17m) event-controller Pod postgres-lyppow-postgresql-1: Readiness probe failed: 127.0.0.1:5432 - no response Warning Unhealthy 15m (x2 over 15m) event-controller Pod postgres-lyppow-postgresql-1: Readiness probe failed: 127.0.0.1:5432 - rejecting connections Warning Unhealthy 4m4s (x4 over 16m) event-controller Pod postgres-lyppow-postgresql-2: Readiness probe failed: 127.0.0.1:5432 - no response Warning ApplyResourcesFailed 17s (x222 over 18m) cluster-controller the number of primary pod is not equal to 1, primary pods: [], emptyRole pods: [postgres-lyppow-postgresql-0]

logs pg
pod0

kubectl logs postgres-lyppow-postgresql-0 Defaulted container "postgresql" out of: postgresql, pgbouncer, metrics, kb-checkrole, config-manager, pg-init-container (init) 2023-08-18 02:41:59,051 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?) 2023-08-18 02:42:01,055 - bootstrapping - INFO - Could not connect to 169.254.169.254, assuming local Docker setup 2023-08-18 02:42:01,055 - bootstrapping - INFO - No meta-data available for this provider 2023-08-18 02:42:01,055 - bootstrapping - INFO - Looks like you are running local 2023-08-18 02:42:01,071 - bootstrapping - INFO - kubeblocks generate local configuration: bootstrap: dcs: postgresql: parameters: archive_command: /bin/true archive_mode: 'on' autovacuum_analyze_scale_factor: '0.05' autovacuum_max_workers: '1' autovacuum_vacuum_scale_factor: '0.1' checkpoint_completion_target: '0.95' hot_standby: 'on' log_autovacuum_min_duration: 1s log_checkpoints: 'True' log_lock_waits: 'True' log_min_duration_statement: '100' log_statement: ddl max_connections: '10000' max_locks_per_transaction: '128' max_prepared_transactions: '0' max_replication_slots: '16' max_wal_senders: '24' max_worker_processes: '8' track_commit_timestamp: 'True' track_functions: all wal_keep_size: 20GB wal_level: replica wal_log_hints: 'on' initdb:

2023-08-18 02:42:01,156 - bootstrapping - INFO - Configuring patroni 2023-08-18 02:42:01,163 - bootstrapping - INFO - Writing to file /run/postgres.yml 2023-08-18 02:42:01,163 - bootstrapping - INFO - Configuring pgqd 2023-08-18 02:42:01,163 - bootstrapping - INFO - Configuring pam-oauth2 2023-08-18 02:42:01,163 - bootstrapping - INFO - No PAM_OAUTH2 configuration was specified, skipping 2023-08-18 02:42:01,163 - bootstrapping - INFO - Configuring certificate 2023-08-18 02:42:01,163 - bootstrapping - INFO - Generating ssl self-signed certificate 2023-08-18 02:42:02,163 - bootstrapping - INFO - Configuring log 2023-08-18 02:42:02,163 - bootstrapping - INFO - Configuring standby-cluster 2023-08-18 02:42:02,163 - bootstrapping - INFO - Configuring crontab 2023-08-18 02:42:02,163 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability 2023-08-18 02:42:02,164 - bootstrapping - INFO - Configuring wal-e 2023-08-18 02:42:02,164 - bootstrapping - INFO - Configuring bootstrap 2023-08-18 02:42:02,164 - bootstrapping - INFO - Configuring pgbouncer 2023-08-18 02:42:02,164 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping 2023-08-18 02:42:03,263 INFO: Selected new K8s API server endpoint https://192.168.0.136:6443 2023-08-18 02:42:03,367 INFO: No PostgreSQL configuration items changed, nothing to reload. 2023-08-18 02:42:03,371 WARNING: Postgresql is not running. 2023-08-18 02:42:03,371 INFO: Lock owner: postgres-lyppow-postgresql-1; I am postgres-lyppow-postgresql-0 2023-08-18 02:42:03,372 INFO: pg_controldata: pg_control version number: 1300 Catalog version number: 202107181 Database system identifier: 7268476893860360275 Database cluster state: shut down pg_control last modified: Fri Aug 18 02:40:51 2023 Latest checkpoint location: 0/7000028 Latest checkpoint's REDO location: 0/7000028 Latest checkpoint's REDO WAL file: 000000030000000000000007 Latest checkpoint's TimeLineID: 3 Latest checkpoint's PrevTimeLineID: 3 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0:1434 Latest checkpoint's NextOID: 16907 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Latest checkpoint's oldestXID: 727 Latest checkpoint's oldestXID's DB: 1 Latest checkpoint's oldestActiveXID: 0 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 1 Latest checkpoint's oldestCommitTsXid: 734 Latest checkpoint's newestCommitTsXid: 1433 Time of latest checkpoint: Fri Aug 18 02:40:51 2023 Fake LSN counter for unlogged rels: 0/3E8 Minimum recovery ending location: 0/0 Min recovery ending loc's timeline: 0 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no wal_level setting: replica wal_log_hints setting: on max_connections setting: 10000 max_worker_processes setting: 8 max_wal_senders setting: 24 max_prepared_xacts setting: 0 max_locks_per_xact setting: 128 track_commit_timestamp setting: on Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Size of a large-object chunk: 2048 Date/time type storage: 64-bit integers Float8 argument passing: by value Data page checksum version: 0 Mock authentication nonce: 5b8a0aa36f561fe90d3d3fb50a91ab4fdc066497295008544cfddc95741fe3f8

2023-08-18 02:42:03,373 INFO: Lock owner: postgres-lyppow-postgresql-1; I am postgres-lyppow-postgresql-0 2023-08-18 02:42:03,456 INFO: Local timeline=3 lsn=0/7000028 2023-08-18 02:42:03,590 INFO: primary_timeline=5 2023-08-18 02:42:03,593 INFO: primary: history=1 0/20000A0 no recovery target specified 2 0/30000A0 no recovery target specified 3 0/70000A0 no recovery target specified 4 0/701FFD8 no recovery target specified 2023-08-18 02:42:03,593 INFO: Lock owner: postgres-lyppow-postgresql-1; I am postgres-lyppow-postgresql-0 2023-08-18 02:42:03,658 INFO: starting as a secondary 2023-08-18 02:42:04 GMT [106]: [1-1] 64deda7c.6a 0 LOG: Auto detecting pg_stat_kcache.linux_hz parameter... 2023-08-18 02:42:04 GMT [106]: [2-1] 64deda7c.6a 0 LOG: pg_stat_kcache.linux_hz is set to 1000000 2023-08-18 02:42:04,657 INFO: postmaster pid=106 /var/run/postgresql:5432 - no response /var/run/postgresql:5432 - no response 2023-08-18 02:42:06 GMT [106]: [3-1] 64deda7c.6a 0 LOG: redirecting log output to logging collector process 2023-08-18 02:42:06 GMT [106]: [4-1] 64deda7c.6a 0 HINT: Future log output will appear in directory "../pg_log". /var/run/postgresql:5432 - accepting connections /var/run/postgresql:5432 - accepting connections 2023-08-18 02:42:06,951 INFO: Lock owner: postgres-lyppow-postgresql-1; I am postgres-lyppow-postgresql-0 2023-08-18 02:42:06,951 INFO: establishing a new patroni connection to the postgres cluster 2023-08-18 02:42:06,991 INFO: Dropped unknown replication slot 'postgres_lyppow_postgresql_2' 2023-08-18 02:42:07,051 INFO: Dropped unknown replication slot 'postgres_lyppow_postgresql_1' 2023-08-18 02:42:07,152 INFO: no action. I am (postgres-lyppow-postgresql-0), a secondary, and following a leader (postgres-lyppow-postgresql-1) 2023-08-18 02:42:08,450 INFO: no action. I am (postgres-lyppow-postgresql-0), a secondary, and following a leader (postgres-lyppow-postgresql-1) 2023-08-18 02:42:18,457 INFO: no action. I am (postgres-lyppow-postgresql-0), a secondary, and following a leader (postgres-lyppow-postgresql-1) 2023-08-18 02:42:28,539 INFO: no action. I am (postgres-lyppow-postgresql-0), a secondary, and following a leader (postgres-lyppow-postgresql-1) 2023-08-18 02:42:38,524 INFO: no action. I am (postgres-lyppow-postgresql-0), a secondary, and following a leader (postgres-lyppow-postgresql-1) 2023-08-18 02:42:48,526 INFO: no action. I am (postgres-lyppow-postgresql-0), a secondary, and following a leader (postgres-lyppow-postgresql-1) 2023-08-18 02:42:58,454 INFO: no action. I am (postgres-lyppow-postgresql-0), a secondary, and following a leader (postgres-lyppow-postgresql-1) 2023-08-18 02:43:08,608 INFO: no action. I am (postgres-lyppow-postgresql-0), a secondary, and following a leader (postgres-lyppow-postgresql-1)

pod1

kubectl logs postgres-lyppow-postgresql-1 Defaulted container "postgresql" out of: postgresql, pgbouncer, metrics, kb-checkrole, config-manager, pg-init-container (init) postgres-lyppow-postgresql-0.postgres-lyppow-postgresql-headless.default.svc:5432 - accepting connections 2023-08-18 02:38:52,400 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?) 2023-08-18 02:38:54,404 - bootstrapping - INFO - Could not connect to 169.254.169.254, assuming local Docker setup 2023-08-18 02:38:54,405 - bootstrapping - INFO - No meta-data available for this provider 2023-08-18 02:38:54,405 - bootstrapping - INFO - Looks like you are running local 2023-08-18 02:38:54,421 - bootstrapping - INFO - kubeblocks generate local configuration: bootstrap: dcs: postgresql: parameters: archive_command: /bin/true archive_mode: 'on' autovacuum_analyze_scale_factor: '0.05' autovacuum_max_workers: '1' autovacuum_vacuum_scale_factor: '0.1' checkpoint_completion_target: '0.95' hot_standby: 'on' log_autovacuum_min_duration: 1s log_checkpoints: 'True' log_lock_waits: 'True' log_min_duration_statement: '100' log_statement: ddl max_connections: '10000' max_locks_per_transaction: '128' max_prepared_transactions: '0' max_replication_slots: '16' max_wal_senders: '24' max_worker_processes: '8' track_commit_timestamp: 'True' track_functions: all wal_keep_size: 20GB wal_level: replica wal_log_hints: 'on' initdb:

2023-08-18 02:38:54,496 - bootstrapping - INFO - Configuring crontab 2023-08-18 02:38:54,496 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability 2023-08-18 02:38:54,497 - bootstrapping - INFO - Configuring certificate 2023-08-18 02:38:54,497 - bootstrapping - INFO - Generating ssl self-signed certificate 2023-08-18 02:38:55,696 - bootstrapping - INFO - Configuring patroni 2023-08-18 02:38:55,703 - bootstrapping - INFO - Writing to file /run/postgres.yml 2023-08-18 02:38:55,703 - bootstrapping - INFO - Configuring pgqd 2023-08-18 02:38:55,703 - bootstrapping - INFO - Configuring bootstrap 2023-08-18 02:38:55,704 - bootstrapping - INFO - Configuring wal-e 2023-08-18 02:38:55,704 - bootstrapping - INFO - Configuring pam-oauth2 2023-08-18 02:38:55,704 - bootstrapping - INFO - No PAM_OAUTH2 configuration was specified, skipping 2023-08-18 02:38:55,704 - bootstrapping - INFO - Configuring log 2023-08-18 02:38:55,704 - bootstrapping - INFO - Configuring standby-cluster 2023-08-18 02:38:55,704 - bootstrapping - INFO - Configuring pgbouncer 2023-08-18 02:38:55,704 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping 2023-08-18 02:38:57,725 INFO: Selected new K8s API server endpoint https://192.168.0.135:6443 2023-08-18 02:38:57,806 INFO: No PostgreSQL configuration items changed, nothing to reload. 2023-08-18 02:38:57,811 WARNING: Postgresql is not running. 2023-08-18 02:38:57,811 INFO: Lock owner: postgres-lyppow-postgresql-0; I am postgres-lyppow-postgresql-1 2023-08-18 02:38:57,812 INFO: pg_controldata: pg_control version number: 1300 Catalog version number: 202107181 Database system identifier: 7268476893860360275 Database cluster state: shut down in recovery pg_control last modified: Fri Aug 18 02:38:11 2023 Latest checkpoint location: 0/6000098 Latest checkpoint's REDO location: 0/6000060 Latest checkpoint's REDO WAL file: 000000030000000000000006 Latest checkpoint's TimeLineID: 3 Latest checkpoint's PrevTimeLineID: 3 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0:1434 Latest checkpoint's NextOID: 24999 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Latest checkpoint's oldestXID: 727 Latest checkpoint's oldestXID's DB: 1 Latest checkpoint's oldestActiveXID: 1434 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 1 Latest checkpoint's oldestCommitTsXid: 734 Latest checkpoint's newestCommitTsXid: 1433 Time of latest checkpoint: Fri Aug 18 02:30:00 2023 Fake LSN counter for unlogged rels: 0/3E8 Minimum recovery ending location: 0/6000148 Min recovery ending loc's timeline: 3 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no wal_level setting: replica wal_log_hints setting: on max_connections setting: 10000 max_worker_processes setting: 8 max_wal_senders setting: 24 max_prepared_xacts setting: 0 max_locks_per_xact setting: 128 track_commit_timestamp setting: on Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Size of a large-object chunk: 2048 Date/time type storage: 64-bit integers Float8 argument passing: by value Data page checksum version: 0 Mock authentication nonce: 5b8a0aa36f561fe90d3d3fb50a91ab4fdc066497295008544cfddc95741fe3f8

2023-08-18 02:38:57,813 INFO: Lock owner: postgres-lyppow-postgresql-0; I am postgres-lyppow-postgresql-1 2023-08-18 02:38:57,897 INFO: Local timeline=3 lsn=0/6000148 2023-08-18 02:38:57,914 INFO: primary_timeline=3 2023-08-18 02:38:57,914 INFO: Lock owner: postgres-lyppow-postgresql-0; I am postgres-lyppow-postgresql-1 2023-08-18 02:38:57,996 INFO: starting as a secondary 2023-08-18 02:38:59 GMT [64]: [1-1] 64ded9c3.40 0 LOG: Auto detecting pg_stat_kcache.linux_hz parameter... 2023-08-18 02:38:59 GMT [64]: [2-1] 64ded9c3.40 0 LOG: pg_stat_kcache.linux_hz is set to 1000000 2023-08-18 02:38:59,197 INFO: postmaster pid=64 /var/run/postgresql:5432 - no response /var/run/postgresql:5432 - no response /var/run/postgresql:5432 - no response 2023-08-18 02:39:01 GMT [64]: [3-1] 64ded9c3.40 0 LOG: redirecting log output to logging collector process 2023-08-18 02:39:01 GMT [64]: [4-1] 64ded9c3.40 0 HINT: Future log output will appear in directory "../pg_log". 2023-08-18 02:39:02,279 INFO: Lock owner: postgres-lyppow-postgresql-0; I am postgres-lyppow-postgresql-1 2023-08-18 02:39:02,354 INFO: restarting after failure in progress /var/run/postgresql:5432 - accepting connections /var/run/postgresql:5432 - accepting connections 2023-08-18 02:39:02,692 INFO: Lock owner: postgres-lyppow-postgresql-0; I am postgres-lyppow-postgresql-1 2023-08-18 02:39:02,692 INFO: establishing a new patroni connection to the postgres cluster 2023-08-18 02:39:02,893 INFO: no action. I am (postgres-lyppow-postgresql-1), a secondary, and following a leader (postgres-lyppow-postgresql-0) 2023-08-18 02:39:12,288 INFO: no action. I am (postgres-lyppow-postgresql-1), a secondary, and following a leader (postgres-lyppow-postgresql-0) 2023-08-18 02:39:22,286 INFO: no action. I am (postgres-lyppow-postgresql-1), a secondary, and following a leader (postgres-lyppow-postgresql-0) 2023-08-18 02:39:32,286 INFO: no action. I am (postgres-lyppow-postgresql-1), a secondary, and following a leader (postgres-lyppow-postgresql-0) 2023-08-18 02:39:42,285 INFO: no action. I am (postgres-lyppow-postgresql-1), a secondary, and following a leader (postgres-lyppow-postgresql-0) 2023-08-18 02:39:52,284 INFO: no action. I am (postgres-lyppow-postgresql-1), a secondary, and following a leader (postgres-lyppow-postgresql-0) 2023-08-18 02:40:02,287 INFO: no action. I am (postgres-lyppow-postgresql-1), a secondary, and following a leader (postgres-lyppow-postgresql-0) 2023-08-18 02:40:12,285 INFO: no action. I am (postgres-lyppow-postgresql-1), a secondary, and following a leader (postgres-lyppow-postgresql-0) 2023-08-18 02:40:22,286 INFO: no action. I am (postgres-lyppow-postgresql-1), a secondary, and following a leader (postgres-lyppow-postgresql-0) 2023-08-18 02:40:32,286 INFO: no action. I am (postgres-lyppow-postgresql-1), a secondary, and following a leader (postgres-lyppow-postgresql-0) 2023-08-18 02:40:42,285 INFO: no action. I am (postgres-lyppow-postgresql-1), a secondary, and following a leader (postgres-lyppow-postgresql-0) 2023-08-18 02:40:52,338 INFO: Got response from postgres-lyppow-postgresql-2 http://192.168.0.251:8008/patroni: {"state": "running", "postmaster_start_time": "2023-08-18 02:40:01.804025+00:00", "role": "replica", "server_version": 140008, "xlog": {"received_location": 117440672, "replayed_location": 117440672, "replayed_timestamp": null, "paused": false}, "timeline": 3, "dcs_last_seen": 1692326452, "database_system_identifier": "7268476893860360275", "patroni": {"version": "3.0.2", "scope": "postgres-lyppow-postgresql-patronif34af341"}} 2023-08-18 02:40:52,399 WARNING: Request failed to postgres-lyppow-postgresql-0: GET http://192.168.0.4:8008/patroni (HTTPConnectionPool(host='192.168.0.4', port=8008): Max retries exceeded with url: /patroni (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')))) 2023-08-18 02:40:52,447 INFO: Could not take out TTL lock 2023-08-18 02:40:52,448 ERROR: ObjectCache.run ProtocolError("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)) 2023-08-18 02:40:52,449 ERROR: ObjectCache.run ProtocolError("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)) server signaled 2023-08-18 02:40:52,599 INFO: following new leader after trying and failing to obtain lock 2023-08-18 02:40:52,696 INFO: Lock owner: postgres-lyppow-postgresql-2; I am postgres-lyppow-postgresql-1 2023-08-18 02:40:52,702 INFO: Local timeline=3 lsn=0/70000A0 2023-08-18 02:40:52,791 INFO: no action. I am (postgres-lyppow-postgresql-1), a secondary, and following a leader (postgres-lyppow-postgresql-2) 2023-08-18 02:40:54,244 INFO: Lock owner: postgres-lyppow-postgresql-2; I am postgres-lyppow-postgresql-1 2023-08-18 02:40:54,246 INFO: Local timeline=3 lsn=0/70000A0 2023-08-18 02:40:54,247 INFO: no action. I am (postgres-lyppow-postgresql-1), a secondary, and following a leader (postgres-lyppow-postgresql-2) 2023-08-18 02:41:04,749 INFO: Lock owner: postgres-lyppow-postgresql-2; I am postgres-lyppow-postgresql-1 2023-08-18 02:41:04,749 INFO: Local timeline=4 lsn=0/70000A0 2023-08-18 02:41:07,217 ERROR: Exception when working with leader Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/rewind.py", line 69, in check_leader_is_not_in_recovery with get_connection_cursor(connect_timeout=3, options='-c statement_timeout=2000', conn_kwargs) as cur: File "/usr/lib/python3.10/contextlib.py", line 135, in enter return next(self.gen) File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/connection.py", line 43, in get_connection_cursor conn = psycopg.connect(kwargs) File "/usr/local/lib/python3.10/dist-packages/patroni/psycopg.py", line 42, in connect ret = _connect(*args, kwargs) File "/usr/lib/python3/dist-packages/psycopg2/init.py", line 122, in connect conn = _connect(dsn, connection_factory=connection_factory, kwasync) psycopg2.OperationalError: connection to server at "192.168.0.251", port 5432 failed: timeout expired

2023-08-18 02:41:07,220 INFO: Selected new K8s API server endpoint https://192.168.0.136:6443 2023-08-18 02:41:07,221 INFO: no action. I am (postgres-lyppow-postgresql-1), a secondary, and following a leader (postgres-lyppow-postgresql-2) 2023-08-18 02:41:14,756 INFO: Lock owner: postgres-lyppow-postgresql-2; I am postgres-lyppow-postgresql-1 2023-08-18 02:41:14,756 INFO: Local timeline=4 lsn=0/701FFD8 2023-08-18 02:41:17,790 ERROR: Exception when working with leader Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/rewind.py", line 69, in check_leader_is_not_in_recovery with get_connection_cursor(connect_timeout=3, options='-c statement_timeout=2000', conn_kwargs) as cur: File "/usr/lib/python3.10/contextlib.py", line 135, in enter return next(self.gen) File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/connection.py", line 43, in get_connection_cursor conn = psycopg.connect(kwargs) File "/usr/local/lib/python3.10/dist-packages/patroni/psycopg.py", line 42, in connect ret = _connect(*args, kwargs) File "/usr/lib/python3/dist-packages/psycopg2/init.py", line 122, in connect conn = _connect(dsn, connection_factory=connection_factory, kwasync) psycopg2.OperationalError: connection to server at "192.168.0.251", port 5432 failed: timeout expired

2023-08-18 02:41:17,997 INFO: no action. I am (postgres-lyppow-postgresql-1), a secondary, and following a leader (postgres-lyppow-postgresql-2) 2023-08-18 02:41:26,752 WARNING: Request failed to postgres-lyppow-postgresql-2: GET http://192.168.0.251:8008/patroni (HTTPConnectionPool(host='192.168.0.251', port=8008): Max retries exceeded with url: /patroni (Caused by ReadTimeoutError("HTTPConnectionPool(host='192.168.0.251', port=8008): Read timed out. (read timeout=2)"))) 2023-08-18 02:41:26,800 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'" 2023-08-18 02:41:26,877 INFO: promoted self to leader by acquiring session lock server promoting 2023-08-18 02:41:26,881 INFO: cleared rewind state after becoming the leader 2023-08-18 02:41:26,878 INFO: Lock owner: postgres-lyppow-postgresql-1; I am postgres-lyppow-postgresql-1 2023-08-18 02:41:27,197 INFO: updated leader lock during promote 2023-08-18 02:41:27,895 INFO: Lock owner: postgres-lyppow-postgresql-1; I am postgres-lyppow-postgresql-1 2023-08-18 02:41:28,691 ERROR: Failed to create physical replication slot 'postgres_lyppow_postgresql_0' Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/init.py", line 295, in _query cursor.execute(sql, params or None) psycopg2.OperationalError: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/slots.py", line 219, in _ensure_physical_slots self._query(("SELECT pg_catalog.pg_create_physical_replication_slot(%s{0})" + File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/slots.py", line 115, in _query return self._postgresql.query(sql, params, retry=False) File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/init.py", line 313, in query return self._query(sql, args) File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/init.py", line 309, in _query raise PostgresConnectionException('connection problems') patroni.exceptions.PostgresConnectionException: 'connection problems' WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. 2023-08-18 02:41:28,698 ERROR: Exception during CHECKPOINT Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/init.py", line 646, in checkpoint with get_connection_cursor(connect_kwargs) as cur: File "/usr/lib/python3.10/contextlib.py", line 135, in enter return next(self.gen) File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/connection.py", line 43, in get_connection_cursor conn = psycopg.connect(kwargs) File "/usr/local/lib/python3.10/dist-packages/patroni/psycopg.py", line 42, in connect ret = _connect(*args, kwargs) File "/usr/lib/python3/dist-packages/psycopg2/init.py", line 122, in connect conn = _connect(dsn, connection_factory=connection_factory, kwasync) psycopg2.OperationalError: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.

2023-08-18 02:41:28,895 INFO: no action. I am (postgres-lyppow-postgresql-1), the leader with the lock psql: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: the database system is in recovery mode psql: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: the database system is in recovery mode 2023-08-18 02:41:38,400 INFO: Lock owner: postgres-lyppow-postgresql-1; I am postgres-lyppow-postgresql-1 2023-08-18 02:41:38,400 INFO: establishing a new patroni connection to the postgres cluster 2023-08-18 02:41:38,536 INFO: no action. I am (postgres-lyppow-postgresql-1), the leader with the lock 2023-08-18 02:41:48,692 INFO: no action. I am (postgres-lyppow-postgresql-1), the leader with the lock 2023-08-18 02:41:56.926 UTC [32] LOG Starting pgqd 3.5 2023-08-18 02:41:56.926 UTC [32] LOG auto-detecting dbs ... 2023-08-18 02:41:58,447 INFO: no action. I am (postgres-lyppow-postgresql-1), the leader with the lock 2023-08-18 02:42:08,449 INFO: no action. I am (postgres-lyppow-postgresql-1), the leader with the lock 2023-08-18 02:42:18,400 INFO: Lock owner: postgres-lyppow-postgresql-1; I am postgres-lyppow-postgresql-1 2023-08-18 02:42:18,400 ERROR: Exception when called state_handler.last_operation() Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 243, in update_lock last_lsn = self.state_handler.last_operation() File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/init.py", line 1070, in last_operation return self._wal_position(self.is_leader(), self._cluster_info_state_get('wal_position'), File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/init.py", line 419, in is_leader return bool(self._cluster_info_state_get('timeline')) File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/init.py", line 371, in _cluster_info_state_get result = self._is_leader_retry(self._query, self.cluster_info_query).fetchone() File "/usr/local/lib/python3.10/dist-packages/patroni/utils.py", line 334, in call return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/init.py", line 306, in _query raise e File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/init.py", line 295, in _query cursor.execute(sql, params or None) psycopg2.DatabaseError: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.

2023-08-18 02:42:18,880 INFO: establishing a new patroni connection to the postgres cluster 2023-08-18 02:42:20,199 WARNING: Retry got exception: 'connection problems' /var/run/postgresql:5432 - rejecting connections 2023-08-18 02:42:20,490 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role 2023-08-18 02:42:20,699 INFO: Error communicating with PostgreSQL. Will try again later 2023-08-18 02:42:26.956 UTC [32] LOG {ticks: 0, maint: 0, retry: 0} /var/run/postgresql:5432 - accepting connections 2023-08-18 02:42:28,407 INFO: Lock owner: postgres-lyppow-postgresql-1; I am postgres-lyppow-postgresql-1 2023-08-18 02:42:28,407 INFO: establishing a new patroni connection to the postgres cluster 2023-08-18 02:42:28,597 INFO: no action. I am (postgres-lyppow-postgresql-1), the leader with the lock 2023-08-18 02:42:38,400 INFO: Lock owner: postgres-lyppow-postgresql-1; I am postgres-lyppow-postgresql-1 2023-08-18 02:42:38,400 ERROR: Exception when called state_handler.last_operation() Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 243, in update_lock last_lsn = self.state_handler.last_operation() File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/init.py", line 1070, in last_operation return self._wal_position(self.is_leader(), self._cluster_info_state_get('wal_position'), File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/init.py", line 419, in is_leader return bool(self._cluster_info_state_get('timeline')) File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/init.py", line 371, in _cluster_info_state_get result = self._is_leader_retry(self._query, self.cluster_info_query).fetchone() File "/usr/local/lib/python3.10/dist-packages/patroni/utils.py", line 334, in call return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/init.py", line 306, in _query raise e File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/init.py", line 295, in _query cursor.execute(sql, params or None) psycopg2.DatabaseError: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.

pod2

kubectl logs postgres-lyppow-postgresql-2 Defaulted container "postgresql" out of: postgresql, pgbouncer, metrics, kb-checkrole, config-manager, pg-init-container (init) postgres-lyppow-postgresql-0.postgres-lyppow-postgresql-headless.default.svc:5432 - accepting connections 2023-08-18 02:39:53,395 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?) 2023-08-18 02:39:55,400 - bootstrapping - INFO - Could not connect to 169.254.169.254, assuming local Docker setup 2023-08-18 02:39:55,400 - bootstrapping - INFO - No meta-data available for this provider 2023-08-18 02:39:55,401 - bootstrapping - INFO - Looks like you are running local 2023-08-18 02:39:55,417 - bootstrapping - INFO - kubeblocks generate local configuration: bootstrap: dcs: postgresql: parameters: archive_command: /bin/true archive_mode: 'on' autovacuum_analyze_scale_factor: '0.05' autovacuum_max_workers: '1' autovacuum_vacuum_scale_factor: '0.1' checkpoint_completion_target: '0.95' hot_standby: 'on' log_autovacuum_min_duration: 1s log_checkpoints: 'True' log_lock_waits: 'True' log_min_duration_statement: '100' log_statement: ddl max_connections: '10000' max_locks_per_transaction: '128' max_prepared_transactions: '0' max_replication_slots: '16' max_wal_senders: '24' max_worker_processes: '8' track_commit_timestamp: 'True' track_functions: all wal_keep_size: 20GB wal_level: replica wal_log_hints: 'on' initdb:

2023-08-18 02:39:55,508 - bootstrapping - INFO - Configuring standby-cluster 2023-08-18 02:39:55,508 - bootstrapping - INFO - Configuring bootstrap 2023-08-18 02:39:55,508 - bootstrapping - INFO - Configuring wal-e 2023-08-18 02:39:55,508 - bootstrapping - INFO - Configuring pam-oauth2 2023-08-18 02:39:55,508 - bootstrapping - INFO - No PAM_OAUTH2 configuration was specified, skipping 2023-08-18 02:39:55,508 - bootstrapping - INFO - Configuring pgbouncer 2023-08-18 02:39:55,508 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping 2023-08-18 02:39:55,508 - bootstrapping - INFO - Configuring pgqd 2023-08-18 02:39:55,509 - bootstrapping - INFO - Configuring certificate 2023-08-18 02:39:55,509 - bootstrapping - INFO - Generating ssl self-signed certificate 2023-08-18 02:39:56,585 - bootstrapping - INFO - Configuring crontab 2023-08-18 02:39:56,586 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability 2023-08-18 02:39:56,586 - bootstrapping - INFO - Configuring log 2023-08-18 02:39:56,586 - bootstrapping - INFO - Configuring patroni 2023-08-18 02:39:56,593 - bootstrapping - INFO - Writing to file /run/postgres.yml 2023-08-18 02:39:58,602 INFO: Selected new K8s API server endpoint https://192.168.0.136:6443 2023-08-18 02:39:58,784 INFO: No PostgreSQL configuration items changed, nothing to reload. 2023-08-18 02:39:58,789 WARNING: Postgresql is not running. 2023-08-18 02:39:58,789 INFO: Lock owner: postgres-lyppow-postgresql-0; I am postgres-lyppow-postgresql-2 2023-08-18 02:39:58,790 INFO: pg_controldata: pg_control version number: 1300 Catalog version number: 202107181 Database system identifier: 7268476893860360275 Database cluster state: shut down in recovery pg_control last modified: Fri Aug 18 02:39:04 2023 Latest checkpoint location: 0/6000098 Latest checkpoint's REDO location: 0/6000060 Latest checkpoint's REDO WAL file: 000000030000000000000006 Latest checkpoint's TimeLineID: 3 Latest checkpoint's PrevTimeLineID: 3 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0:1434 Latest checkpoint's NextOID: 24999 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Latest checkpoint's oldestXID: 727 Latest checkpoint's oldestXID's DB: 1 Latest checkpoint's oldestActiveXID: 1434 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 1 Latest checkpoint's oldestCommitTsXid: 734 Latest checkpoint's newestCommitTsXid: 1433 Time of latest checkpoint: Fri Aug 18 02:30:00 2023 Fake LSN counter for unlogged rels: 0/3E8 Minimum recovery ending location: 0/6000148 Min recovery ending loc's timeline: 3 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no wal_level setting: replica wal_log_hints setting: on max_connections setting: 10000 max_worker_processes setting: 8 max_wal_senders setting: 24 max_prepared_xacts setting: 0 max_locks_per_xact setting: 128 track_commit_timestamp setting: on Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Size of a large-object chunk: 2048 Date/time type storage: 64-bit integers Float8 argument passing: by value Data page checksum version: 0 Mock authentication nonce: 5b8a0aa36f561fe90d3d3fb50a91ab4fdc066497295008544cfddc95741fe3f8

2023-08-18 02:39:58,791 INFO: Lock owner: postgres-lyppow-postgresql-0; I am postgres-lyppow-postgresql-2 2023-08-18 02:39:58,798 INFO: Local timeline=3 lsn=0/6000148 2023-08-18 02:39:58,894 INFO: primary_timeline=3 2023-08-18 02:39:58,895 INFO: Lock owner: postgres-lyppow-postgresql-0; I am postgres-lyppow-postgresql-2 2023-08-18 02:39:58,936 INFO: starting as a secondary 2023-08-18 02:39:59 GMT [56]: [1-1] 64ded9ff.38 0 LOG: Auto detecting pg_stat_kcache.linux_hz parameter... 2023-08-18 02:39:59 GMT [56]: [2-1] 64ded9ff.38 0 LOG: pg_stat_kcache.linux_hz is set to 1000000 2023-08-18 02:39:59,979 INFO: postmaster pid=56 /var/run/postgresql:5432 - no response /var/run/postgresql:5432 - no response 2023-08-18 02:40:01 GMT [56]: [3-1] 64ded9ff.38 0 LOG: redirecting log output to logging collector process 2023-08-18 02:40:01 GMT [56]: [4-1] 64ded9ff.38 0 HINT: Future log output will appear in directory "../pg_log". /var/run/postgresql:5432 - accepting connections /var/run/postgresql:5432 - accepting connections 2023-08-18 02:40:02,093 INFO: Lock owner: postgres-lyppow-postgresql-0; I am postgres-lyppow-postgresql-2 2023-08-18 02:40:02,093 INFO: establishing a new patroni connection to the postgres cluster 2023-08-18 02:40:02,280 INFO: no action. I am (postgres-lyppow-postgresql-2), a secondary, and following a leader (postgres-lyppow-postgresql-0) 2023-08-18 02:40:02,284 INFO: no action. I am (postgres-lyppow-postgresql-2), a secondary, and following a leader (postgres-lyppow-postgresql-0) 2023-08-18 02:40:12,287 INFO: no action. I am (postgres-lyppow-postgresql-2), a secondary, and following a leader (postgres-lyppow-postgresql-0) 2023-08-18 02:40:22,306 INFO: no action. I am (postgres-lyppow-postgresql-2), a secondary, and following a leader (postgres-lyppow-postgresql-0) 2023-08-18 02:40:32,286 INFO: no action. I am (postgres-lyppow-postgresql-2), a secondary, and following a leader (postgres-lyppow-postgresql-0) 2023-08-18 02:40:42,298 INFO: no action. I am (postgres-lyppow-postgresql-2), a secondary, and following a leader (postgres-lyppow-postgresql-0) 2023-08-18 02:40:52,344 INFO: Got response from postgres-lyppow-postgresql-1 http://192.168.0.22:8008/patroni: {"state": "running", "postmaster_start_time": "2023-08-18 02:39:01.706566+00:00", "role": "replica", "server_version": 140008, "xlog": {"received_location": 117440672, "replayed_location": 117440672, "replayed_timestamp": null, "paused": false}, "timeline": 3, "cluster_unlocked": true, "dcs_last_seen": 1692326452, "database_system_identifier": "7268476893860360275", "patroni": {"version": "3.0.2", "scope": "postgres-lyppow-postgresql-patronif34af341"}} 2023-08-18 02:40:52,399 WARNING: Request failed to postgres-lyppow-postgresql-0: GET http://192.168.0.4:8008/patroni (HTTPConnectionPool(host='192.168.0.4', port=8008): Max retries exceeded with url: /patroni (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')))) 2023-08-18 02:40:52,452 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'" 2023-08-18 02:40:52,529 INFO: promoted self to leader by acquiring session lock server promoting 2023-08-18 02:40:52,582 INFO: cleared rewind state after becoming the leader 2023-08-18 02:40:52,534 INFO: Lock owner: postgres-lyppow-postgresql-2; I am postgres-lyppow-postgresql-2 2023-08-18 02:40:52,783 INFO: updated leader lock during promote 2023-08-18 02:40:55,582 INFO: no action. I am (postgres-lyppow-postgresql-2), the leader with the lock DO NOTICE: role "admin" is already a member of role "cron_admin" GRANT ROLE DO WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. connection to server was lost 2023-08-18 02:42:07,687 ERROR: get_postgresql_status Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 673, in get_postgresql_status row = self.query(stmt.format(postgresql.wal_name, postgresql.lsn_name), retry=retry)[0] File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 657, in query return retry(self.server.query, sql, params) File "/usr/local/lib/python3.10/dist-packages/patroni/utils.py", line 334, in call return func(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 749, in query raise e File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 745, in query cursor.execute(sql, params) psycopg2.DatabaseError: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.

2023-08-18 02:42:07,882 ERROR: get_postgresql_status Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 745, in query cursor.execute(sql, params) psycopg2.OperationalError: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 673, in get_postgresql_status row = self.query(stmt.format(postgresql.wal_name, postgresql.lsn_name), retry=retry)[0] File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 655, in query return self.server.query(sql, *params) File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 750, in query raise PostgresConnectionException('connection problems') patroni.exceptions.PostgresConnectionException: 'connection problems' 2023-08-18 02:42:07,882 INFO: Lock owner: postgres-lyppow-postgresql-1; I am postgres-lyppow-postgresql-2 2023-08-18 02:42:07,979 INFO: establishing a new patroni connection to the postgres cluster 2023-08-18 02:42:11,102 INFO: establishing a new patroni connection to the postgres cluster 2023-08-18 02:42:11,180 WARNING: Retry got exception: 'connection problems' /var/run/postgresql:5432 - rejecting connections 2023-08-18 02:42:11,283 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role 2023-08-18 02:42:11,283 INFO: Demoting self (immediate-nolock) 2023-08-18 02:42:11,884 INFO: demoting self because I do not have the lock and I was a leader 2023-08-18 02:42:11,889 WARNING: Loop time exceeded, rescheduling immediately. 2023-08-18 02:42:11,889 INFO: Lock owner: postgres-lyppow-postgresql-1; I am postgres-lyppow-postgresql-2 2023-08-18 02:42:11,889 INFO: starting after demotion in progress 2023-08-18 02:42:11,890 INFO: Lock owner: postgres-lyppow-postgresql-1; I am postgres-lyppow-postgresql-2 2023-08-18 02:42:11,890 INFO: starting after demotion in progress 2023-08-18 02:42:12,880 INFO: postmaster pid=275 /var/run/postgresql:5432 - no response 2023-08-18 02:42:12 GMT [275]: [1-1] 64deda84.113 0 LOG: Auto detecting pg_stat_kcache.linux_hz parameter... 2023-08-18 02:42:12 GMT [275]: [2-1] 64deda84.113 0 LOG: pg_stat_kcache.linux_hz is set to 1000000 /var/run/postgresql:5432 - no response 2023-08-18 02:42:14 GMT [275]: [3-1] 64deda84.113 0 LOG: redirecting log output to logging collector process 2023-08-18 02:42:14 GMT [275]: [4-1] 64deda84.113 0 HINT: Future log output will appear in directory "../pg_log". /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - accepting connections 2023-08-18 02:42:16,010 INFO: Lock owner: postgres-lyppow-postgresql-1; I am postgres-lyppow-postgresql-2 2023-08-18 02:42:16,010 INFO: establishing a new patroni connection to the postgres cluster 2023-08-18 02:42:16,091 INFO: Local timeline=4 lsn=0/705B390 2023-08-18 02:42:16,107 INFO: primary_timeline=5 2023-08-18 02:42:16,108 INFO: primary: history=1 0/20000A0 no recovery target specified 2 0/30000A0 no recovery target specified 3 0/70000A0 no recovery target specified 4 0/701FFD8 no recovery target specified 2023-08-18 02:42:16,188 INFO: running pg_rewind from postgres-lyppow-postgresql-1 2023-08-18 02:42:16,795 INFO: running pg_rewind from dbname=postgres user=postgres host=192.168.0.22 port=5432 target_session_attrs=read-write 2023-08-18 02:42:17,894 INFO: pg_rewind exit code=1 2023-08-18 02:42:17,894 INFO: stdout= 2023-08-18 02:42:17,894 INFO: stderr=pg_rewind: servers diverged at WAL location 0/701FFD8 on timeline 4 pg_rewind: rewinding from last common checkpoint at 0/7000140 on timeline 4 WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. pg_rewind: fatal: could not fetch file list: SSL SYSCALL error: EOF detected

2023-08-18 02:42:18,453 INFO: Lock owner: postgres-lyppow-postgresql-1; I am postgres-lyppow-postgresql-2 2023-08-18 02:42:18,524 INFO: running pg_rewind from postgres-lyppow-postgresql-1 in progress 2023-08-18 02:42:20,200 ERROR: Exception when working with leader Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/rewind.py", line 69, in check_leader_is_not_in_recovery with get_connection_cursor(connect_timeout=3, options='-c statement_timeout=2000', conn_kwargs) as cur: File "/usr/lib/python3.10/contextlib.py", line 135, in enter return next(self.gen) File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/connection.py", line 43, in get_connection_cursor conn = psycopg.connect(kwargs) File "/usr/local/lib/python3.10/dist-packages/patroni/psycopg.py", line 42, in connect ret = _connect(*args, kwargs) File "/usr/lib/python3/dist-packages/psycopg2/init.py", line 122, in connect conn = _connect(dsn, connection_factory=connection_factory, kwasync) psycopg2.OperationalError: connection to server at "192.168.0.22", port 5432 failed: timeout expired

2023-08-18 02:42:20,201 WARNING: Failed to rewind because primary postgres-lyppow-postgresql-1 become unreachable 2023-08-18 02:42:20,207 WARNING: Postgresql is not running. 2023-08-18 02:42:20,207 INFO: Lock owner: postgres-lyppow-postgresql-1; I am postgres-lyppow-postgresql-2 2023-08-18 02:42:20,208 INFO: pg_controldata: pg_control version number: 1300 Catalog version number: 202107181 Database system identifier: 7268476893860360275 Database cluster state: shut down in recovery pg_control last modified: Fri Aug 18 02:42:16 2023 Latest checkpoint location: 0/7000140 Latest checkpoint's REDO location: 0/70000D0 Latest checkpoint's REDO WAL file: 000000040000000000000007 Latest checkpoint's TimeLineID: 4 Latest checkpoint's PrevTimeLineID: 4 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0:1434 Latest checkpoint's NextOID: 16907 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Latest checkpoint's oldestXID: 727 Latest checkpoint's oldestXID's DB: 1 Latest checkpoint's oldestActiveXID: 1434 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 1 Latest checkpoint's oldestCommitTsXid: 734 Latest checkpoint's newestCommitTsXid: 1433 Time of latest checkpoint: Fri Aug 18 02:40:52 2023 Fake LSN counter for unlogged rels: 0/3E8 Minimum recovery ending location: 0/705B390 Min recovery ending loc's timeline: 4 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no wal_level setting: replica wal_log_hints setting: on max_connections setting: 10000 max_worker_processes setting: 8 max_wal_senders setting: 24 max_prepared_xacts setting: 0 max_locks_per_xact setting: 128 track_commit_timestamp setting: on Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Size of a large-object chunk: 2048 Date/time type storage: 64-bit integers Float8 argument passing: by value Data page checksum version: 0 Mock authentication nonce: 5b8a0aa36f561fe90d3d3fb50a91ab4fdc066497295008544cfddc95741fe3f8

2023-08-18 02:42:20,209 INFO: Lock owner: postgres-lyppow-postgresql-1; I am postgres-lyppow-postgresql-2 2023-08-18 02:42:20,218 INFO: running pg_rewind from postgres-lyppow-postgresql-1 2023-08-18 02:42:20,397 ERROR: Exception when working with leader Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/rewind.py", line 69, in check_leader_is_not_in_recovery with get_connection_cursor(connect_timeout=3, options='-c statement_timeout=2000', conn_kwargs) as cur: File "/usr/lib/python3.10/contextlib.py", line 135, in enter return next(self.gen) File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/connection.py", line 43, in get_connection_cursor conn = psycopg.connect(kwargs) File "/usr/local/lib/python3.10/dist-packages/patroni/psycopg.py", line 42, in connect ret = _connect(*args, kwargs) File "/usr/lib/python3/dist-packages/psycopg2/init.py", line 122, in connect conn = _connect(dsn, connection_factory=connection_factory, kwasync) psycopg2.OperationalError: connection to server at "192.168.0.22", port 5432 failed: FATAL: the database system is in recovery mode


8 describe  redis

kubectl describe cluster redis-jnhqlr
Name: redis-jnhqlr Namespace: default Labels: clusterdefinition.kubeblocks.io/name=redis clusterversion.kubeblocks.io/name=redis-7.0.6 Annotations: kubeblocks.io/ops-request: [{"name":"redis-jnhqlr-verticalscaling-mgj9k","type":"VerticalScaling"}] kubeblocks.io/reconcile: 2023-08-18T02:59:02.14415814Z API Version: apps.kubeblocks.io/v1alpha1 Kind: Cluster Metadata: Creation Timestamp: 2023-08-18T02:01:09Z Finalizers: cluster.kubeblocks.io/finalizer Generation: 5 Resource Version: 2612001 UID: 58715f4e-4eb6-42d2-a8fa-cde34a6b8bf7 Spec: Affinity: Pod Anti Affinity: Preferred Tenancy: SharedNode Cluster Definition Ref: redis Cluster Version Ref: redis-7.0.6 Component Specs: Class Def Ref: Class:
Component Def Ref: redis Monitor: false Name: redis No Create PDB: false Replicas: 4 Resources: Limits: Cpu: 200m Memory: 644245094400m Requests: Cpu: 200m Memory: 644245094400m Service Account Name: kb-sa-redis-jnhqlr Volume Claim Templates: Name: data Spec: Access Modes: ReadWriteOnce Resources: Requests: Storage: 20Gi Storage Class Name: alicloud-disk-essd Component Def Ref: redis-sentinel Monitor: false Name: redis-sentinel No Create PDB: false Replicas: 3 Resources: Requests: Cpu: 100m Memory: 512Mi Service Account Name: kb-sa-redis-jnhqlr Volume Claim Templates: Name: data Spec: Access Modes: ReadWriteOnce Resources: Requests: Storage: 20Gi Storage Class Name: alicloud-disk-essd Monitor: Resources: Cpu: 0 Memory: 0 Storage: Size: 0 Termination Policy: DoNotTerminate Status: Cluster Def Generation: 3 Components: Redis: Phase: Updating Pods Ready: false Replication Set Status: Primary: Pod: redis-jnhqlr-redis-0 Secondaries: Pod: redis-jnhqlr-redis-3 Pod: redis-jnhqlr-redis-1 Pod: redis-jnhqlr-redis-2 Redis - Sentinel: Phase: Running Pods Ready: true Pods Ready Time: 2023-08-18T02:20:27Z Conditions: Last Transition Time: 2023-08-18T02:37:27Z Message: VerticalScaling opsRequest: redis-jnhqlr-verticalscaling-mgj9k is processing Reason: VerticalScaling Status: False Type: LatestOpsRequestProcessed Last Transition Time: 2023-08-18T02:01:09Z Message: The operator has started the provisioning of Cluster: redis-jnhqlr Observed Generation: 5 Reason: PreCheckSucceed Status: True Type: ProvisioningStarted Last Transition Time: 2023-08-18T02:38:44Z Message: the number of primary pod is not equal to 1, primary pods: [], emptyRole pods: [redis-jnhqlr-redis-0] Reason: ApplyResourcesFailed Status: False Type: ApplyResources Last Transition Time: 2023-08-18T02:37:30Z Message: pods are not ready in Components: [redis], refer to related component message in Cluster.status.components Reason: ReplicasNotReady Status: False Type: ReplicasReady Last Transition Time: 2023-08-18T02:37:30Z Message: pods are unavailable in Components: [redis], refer to related component message in Cluster.status.components Reason: ComponentsNotReady Status: False Type: Ready Observed Generation: 5 Phase: Updating Events: Type Reason Age From Message


Normal PreCheckSucceed 58m cluster-controller The operator has started the provisioning of Cluster: redis-jnhqlr Normal ApplyResourcesSucceed 58m cluster-controller Successfully applied for resources Normal AllReplicasReady 57m cluster-controller all pods of components are ready, waiting for the probe detection successful Normal ClusterReady 57m cluster-controller Cluster: redis-jnhqlr is ready, current phase is Running Normal Running 57m cluster-controller Cluster: redis-jnhqlr is ready, current phase is Running Normal SysAcctCreate 56m system-account-controller Created Accounts for cluster: redis-jnhqlr, component: redis, accounts: kbprobe Normal SysAcctCreate 56m system-account-controller Created Accounts for cluster: redis-jnhqlr, component: redis, accounts: kbmonitoring Normal SysAcctCreate 56m system-account-controller Created Accounts for cluster: redis-jnhqlr, component: redis, accounts: kbadmin Normal SysAcctCreate 56m system-account-controller Created Accounts for cluster: redis-jnhqlr, component: redis, accounts: kbreplicator Normal SysAcctCreate 56m system-account-controller Created Accounts for cluster: redis-jnhqlr, component: redis, accounts: kbdataprotection Warning ApplyResourcesFailed 50m cluster-controller Operation cannot be fulfilled on statefulsets.apps "redis-jnhqlr-redis": the object has been modified; please apply your changes to the latest version and try again Normal WaitingForProbeSuccess 50m (x3 over 50m) cluster-controller Waiting for probe success Warning Failed 50m cluster-controller Cluster: redis-jnhqlr is Failed, check according to the components message Warning ComponentsNotReady 50m cluster-controller pods are unavailable in Components: [redis redis-sentinel], refer to related component message in Cluster.status.components Normal ComponentPhaseTransition 50m (x2 over 50m) cluster-controller Running: false, PodsReady: false, PodsTimedout: false Warning ReplicasNotReady 50m cluster-controller pods are not ready in Components: [redis redis-sentinel], refer to related component message in Cluster.status.components Warning ReplicasNotReady 49m (x2 over 50m) cluster-controller pods are not ready in Components: [redis-sentinel], refer to related component message in Cluster.status.components Warning ComponentsNotReady 49m (x2 over 50m) cluster-controller pods are unavailable in Components: [redis-sentinel], refer to related component message in Cluster.status.components Normal ClusterReady 47m cluster-controller Cluster: redis-jnhqlr is ready, current phase is Running Normal Running 47m cluster-controller Cluster: redis-jnhqlr is ready, current phase is Running Normal AllReplicasReady 47m cluster-controller all pods of components are ready, waiting for the probe detection successful Normal HorizontalScale 41m (x2 over 41m) cluster-controller start horizontal scale component redis of cluster redis-jnhqlr from 1 to 4 Normal ApplyResourcesSucceed 41m (x2 over 50m) cluster-controller Successfully applied for resources Normal HorizontalScaling 41m ops-request-controller Start to process the HorizontalScaling opsRequest "redis-jnhqlr-horizontalscaling-2trsw" in Cluster: redis-jnhqlr Normal PreCheckSucceed 41m cluster-controller The operator has started the provisioning of Cluster: redis-jnhqlr Normal ComponentPhaseTransition 40m (x4 over 49m) cluster-controller Running: true, PodsReady: true, PodsTimedout: false Normal HorizontalScaling 39m ops-request-controller Start to process the HorizontalScaling opsRequest "redis-jnhqlr-horizontalscaling-8d8mq" in Cluster: redis-jnhqlr Normal ComponentPhaseTransition 22m (x6 over 41m) cluster-controller Component workload updated Normal VerticalScaling 22m ops-request-controller Start to process the VerticalScaling opsRequest "redis-jnhqlr-verticalscaling-mgj9k" in Cluster: redis-jnhqlr Warning VerticalScaling 22m cluster-controller VerticalScaling opsRequest: redis-jnhqlr-verticalscaling-mgj9k is processing Warning ApplyResourcesFailed 4m26s (x52 over 20m) cluster-controller the number of primary pod is not equal to 1, primary pods: [], emptyRole pods: [redis-jnhqlr-redis-0] Warning BackOff 35s (x9 over 15m) event-controller Pod redis-jnhqlr-redis-0: Back-off restarting failed container redis in pod redis-jnhqlr-redis-0_default(cbb7563b-331f-4538-a3e5-b30701543bea)

logs pod
pod 0

kubectl logs redis-jnhqlr-redis-0 Defaulted container "redis" out of: redis, metrics, kb-checkrole

Expected behavior pg/redis cluster ok after kubeblocks upgrade from 0.5 to 0.6.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

Y-Rookie commented 1 year ago

cluster Spec has no switchPolicy=Noop, so the label won't be patched

2023-08-18T02:20:02.351Z        INFO    receive role change event       {"controller": "event", "controllerGroup": "", "controllerKind": "Event", "Event": {"name":"postgres-lyppow-postgresql-1.cksgl9tsjs7jk2bh","namespace":"default"}, "namespace": "default", "name": "postgres-lyppow-postgresql-1.cksgl9tsjs7jk2bh", "reconcileID": "647dec39-1ad3-4ca7-96c2-0577f704b960", "event": "default/postgres-lyppow-postgresql-1.cksgl9tsjs7jk2bh", "podName": "postgres-lyppow-postgresql-1", "current pod role label": "secondary", "new role": "secondary"}
2023-08-18T02:20:02.351Z        INFO    cluster switchPolicy is not Noop, does not support handling role change event   {"controller": "event", "controllerGroup": "", "controllerKind": "Event", "Event": {"name":"postgres-lyppow-postgresql-1.cksgl9tsjs7jk2bh","namespace":"default"}, "namespace": "default", "name": "postgres-lyppow-postgresql-1.cksgl9tsjs7jk2bh", "reconcileID": "647dec39-1ad3-4ca7-96c2-0577f704b960", "event": "default/postgres-lyppow-postgresql-1.cksgl9tsjs7jk2bh", "cluster": "postgres-lyppow"}
ldming commented 1 year ago

The primary of cluster is OOM, the memory only 0.5Gi, but the shared_buffer is 1GB that is not expected.

OSG4yOnf9q

JashBook commented 1 year ago

create cluster set the cpu/memory