apecloud / kubeblocks

KubeBlocks is an open-source control plane software that runs and manages databases, message queues and other stateful applications on K8s.
https://kubeblocks.io
GNU Affero General Public License v3.0
2.08k stars 170 forks source link

[BUG]PG cluster status is always creating due to storage is too small #3532

Closed ahjing99 closed 1 year ago

ahjing99 commented 1 year ago

➜ ~ kbcli version Kubernetes: v1.25.8-gke.500 KubeBlocks: 0.6.0-alpha.9 kbcli: 0.6.0-alpha.9


      `kbcli cluster create pgcluster1              --termination-policy=Delete              --monitor=false --enable-all-logs=false --cluster-definition=postgresql --cluster-version=postgresql-12.14.1 --set cpu=100m,memory=0.5Gi,replicas=1,storage=1Gi --namespace kb `

Cluster pgcluster1 created

➜  ~ kbcli cluster list -n kb
NAME            NAMESPACE   CLUSTER-DEFINITION   VERSION              TERMINATION-POLICY   STATUS       CREATED-TIME
mongoc          kb          mongodb              mongodb-5.0.14       Halt                 Restarting   Jun 01,2023 12:04 UTC+0800
pgcluster       kb          postgresql           postgresql-14.7.2    Halt                 Creating     Jun 01,2023 12:04 UTC+0800
pgcluster1      kb          postgresql           postgresql-12.14.1   Delete               Creating     Jun 01,2023 12:15 UTC+0800

➜  ~ k describe cluster pgcluster1 -n kb
Name:         pgcluster1
Namespace:    kb
Labels:       clusterdefinition.kubeblocks.io/name=postgresql
              clusterversion.kubeblocks.io/name=postgresql-12.14.1
Annotations:  kubeblocks.io/reconcile: 2023-06-01T04:17:15.004187769Z
API Version:  apps.kubeblocks.io/v1alpha1
Kind:         Cluster
Metadata:
  Creation Timestamp:  2023-06-01T04:15:59Z
  Finalizers:
    cluster.kubeblocks.io/finalizer
  Generation:  1
  Managed Fields:
    API Version:  apps.kubeblocks.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:affinity:
          .:
          f:nodeLabels:
          f:podAntiAffinity:
          f:tenancy:
          f:topologyKeys:
        f:clusterDefinitionRef:
        f:clusterVersionRef:
        f:componentSpecs:
          .:
          k:{"name":"postgresql"}:
            .:
            f:componentDefRef:
            f:monitor:
            f:name:
            f:noCreatePDB:
            f:replicas:
            f:resources:
              .:
              f:limits:
                .:
                f:cpu:
                f:memory:
              f:requests:
                .:
                f:cpu:
                f:memory:
            f:serviceAccountName:
            f:switchPolicy:
              .:
              f:type:
            f:volumeClaimTemplates:
        f:terminationPolicy:
        f:tolerations:
    Manager:      kbcli
    Operation:    Update
    Time:         2023-06-01T04:15:59Z
    API Version:  apps.kubeblocks.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:clusterDefGeneration:
        f:components:
          .:
          f:postgresql:
            .:
            f:phase:
            f:podsReady:
            f:replicationSetStatus:
              .:
              f:primary:
                .:
                f:pod:
        f:conditions:
        f:observedGeneration:
        f:phase:
    Manager:      manager
    Operation:    Update
    Subresource:  status
    Time:         2023-06-01T04:16:00Z
    API Version:  apps.kubeblocks.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubeblocks.io/reconcile:
        f:finalizers:
          .:
          v:"cluster.kubeblocks.io/finalizer":
        f:labels:
          .:
          f:clusterdefinition.kubeblocks.io/name:
          f:clusterversion.kubeblocks.io/name:
    Manager:         manager
    Operation:       Update
    Time:            2023-06-01T04:17:15Z
  Resource Version:  2829747
  UID:               d31a3abc-7c6b-419a-ae74-49b0a79e7ae2
Spec:
  Affinity:
    Node Labels:
    Pod Anti Affinity:  Preferred
    Tenancy:            SharedNode
    Topology Keys:
  Cluster Definition Ref:  postgresql
  Cluster Version Ref:     postgresql-12.14.1
  Component Specs:
    Component Def Ref:  postgresql
    Monitor:            false
    Name:               postgresql
    No Create PDB:      false
    Replicas:           1
    Resources:
      Limits:
        Cpu:     100m
        Memory:  512Mi
      Requests:
        Cpu:               100m
        Memory:            512Mi
    Service Account Name:  kb-sa-pgcluster1
    Switch Policy:
      Type:  Noop
    Volume Claim Templates:
      Name:  data
      Spec:
        Access Modes:
          ReadWriteOnce
        Resources:
          Requests:
            Storage:   1Gi
  Termination Policy:  Delete
  Tolerations:
Status:
  Cluster Def Generation:  2
  Components:
    Postgresql:
      Phase:       Creating
      Pods Ready:  false
      Replication Set Status:
        Primary:
          Pod:  pgcluster1-postgresql-0
  Conditions:
    Last Transition Time:  2023-06-01T04:15:59Z
    Message:               The operator has started the provisioning of Cluster: pgcluster1
    Observed Generation:   1
    Reason:                PreCheckSucceed
    Status:                True
    Type:                  ProvisioningStarted
    Last Transition Time:  2023-06-01T04:15:59Z
    Message:               Successfully applied for resources
    Observed Generation:   1
    Reason:                ApplyResourcesSucceed
    Status:                True
    Type:                  ApplyResources
    Last Transition Time:  2023-06-01T04:16:00Z
    Message:               pods are not ready in Components: [postgresql], refer to related component message in Cluster.status.components
    Reason:                ReplicasNotReady
    Status:                False
    Type:                  ReplicasReady
    Last Transition Time:  2023-06-01T04:16:00Z
    Message:               pods are unavailable in Components: [postgresql], refer to related component message in Cluster.status.components
    Reason:                ComponentsNotReady
    Status:                False
    Type:                  Ready
  Observed Generation:     1
  Phase:                   Creating
Events:
  Type     Reason                    Age    From                Message
  ----     ------                    ----   ----                -------
  Normal   ComponentPhaseTransition  3m33s  cluster-controller  Create a new component
  Normal   PreCheckSucceed           3m33s  cluster-controller  The operator has started the provisioning of Cluster: pgcluster1
  Normal   ApplyResourcesSucceed     3m33s  cluster-controller  Successfully applied for resources
  Warning  Unhealthy                 2m17s  event-controller    Pod pgcluster1-postgresql-0: Readiness probe failed: 127.0.0.1:5432 - no response

➜  ~ kbcli cluster describe pgcluster1 -n kb
Name: pgcluster1     Created Time: Jun 01,2023 12:15 UTC+0800
NAMESPACE   CLUSTER-DEFINITION   VERSION              STATUS     TERMINATION-POLICY
kb          postgresql           postgresql-12.14.1   Creating   Delete

Endpoints:
COMPONENT    MODE        INTERNAL                                          EXTERNAL
postgresql   ReadWrite   pgcluster1-postgresql.kb.svc.cluster.local:5432   <none>
                         pgcluster1-postgresql.kb.svc.cluster.local:6432

Topology:
COMPONENT    INSTANCE                  ROLE      STATUS    AZ              NODE                                                  CREATED-TIME
postgresql   pgcluster1-postgresql-0   primary   Running   us-central1-c   gke-yjtest-default-pool-c2ee710b-xj1v/10.128.15.208   Jun 01,2023 12:15 UTC+0800

Resources Allocation:
COMPONENT    DEDICATED   CPU(REQUEST/LIMIT)   MEMORY(REQUEST/LIMIT)   STORAGE-SIZE   STORAGE-CLASS
postgresql   false       100m / 100m          512Mi / 512Mi           data:1Gi       standard-rwo

Images:
COMPONENT    TYPE         IMAGE
postgresql   postgresql   registry.cn-hangzhou.aliyuncs.com/apecloud/spilo:12.14.1

Data Protection:
AUTO-BACKUP   BACKUP-SCHEDULE   TYPE     BACKUP-TTL   LAST-SCHEDULE   RECOVERABLE-TIME
Disabled      <none>            <none>   7d           <none>          <none>

Show cluster events: kbcli cluster list-events -n kb pgcluster1

➜  ~ k get pod -n kb|grep pgcluster
pgcluster-postgresql-0           4/5     Running              0             16m
pgcluster1-postgresql-0          4/5     Running              0             4m58s

➜  ~ k logs pgcluster1-postgresql-0 -n kb
Defaulted container "postgresql" out of: postgresql, pgbouncer, metrics, kb-checkrole, config-manager, pg-init-container (init)
+ KB_PRIMARY_POD_NAME_PREFIX=pgcluster1-postgresql-0
+ '[' pgcluster1-postgresql-0 '!=' pgcluster1-postgresql-0 ']'
+ '[' -f /home/postgres/pgdata/kb_restore/kb_restore.signal ']'
+ python3 /kb-scripts/generate_patroni_yaml.py tmp_patroni.yaml
++ cat tmp_patroni.yaml
+ export 'SPILO_CONFIGURATION=bootstrap:
  initdb:
  - auth-host: md5
  - auth-local: trust
  - wal-segsize: '\''1024'\''
postgresql:
  config_dir: /home/postgres/pgdata/conf
  custom_conf: /home/postgres/conf/postgresql.conf
  pg_hba:
  - host     all             all             0.0.0.0/0                md5
  - host     all             all             ::/0                     md5
  - local    all             all                                     trust
  - host     all             all             127.0.0.1/32            trust
  - host     all             all             ::1/128                 trust
  - local     replication     all                                    trust
  - host      replication     all             0.0.0.0/0               md5
  - host      replication     all             ::/0                    md5'
+ SPILO_CONFIGURATION='bootstrap:
  initdb:
  - auth-host: md5
  - auth-local: trust
  - wal-segsize: '\''1024'\''
postgresql:
  config_dir: /home/postgres/pgdata/conf
  custom_conf: /home/postgres/conf/postgresql.conf
  pg_hba:
  - host     all             all             0.0.0.0/0                md5
  - host     all             all             ::/0                     md5
  - local    all             all                                     trust
  - host     all             all             127.0.0.1/32            trust
  - host     all             all             ::1/128                 trust
  - local     replication     all                                    trust
  - host      replication     all             0.0.0.0/0               md5
  - host      replication     all             ::/0                    md5'
+ exec /launch.sh init
2023-06-01 04:16:20,288 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2023-06-01 04:16:20,386 - bootstrapping - INFO - Looks like you are running google
2023-06-01 04:16:20,588 - bootstrapping - INFO - kubeblocks generate local configuration:
bootstrap:
  dcs:
    postgresql:
      parameters:
        archive_command: /bin/true
        archive_mode: 'on'
        autovacuum_analyze_scale_factor: '0.1'
        autovacuum_max_workers: '3'
        autovacuum_vacuum_scale_factor: '0.05'
        checkpoint_completion_target: '0.9'
        log_autovacuum_min_duration: '10000'
        log_checkpoints: 'True'
        log_connections: 'False'
        log_disconnections: 'False'
        log_min_duration_statement: '1000'
        log_statement: ddl
        log_temp_files: 128kB
        max_connections: '56'
        max_locks_per_transaction: '64'
        max_prepared_transactions: '100'
        max_replication_slots: '16'
        max_wal_senders: '64'
        max_worker_processes: '8'
        tcp_keepalives_idle: 45s
        tcp_keepalives_interval: 10s
        track_commit_timestamp: 'False'
        track_functions: pl
        wal_compression: 'True'
        wal_keep_segments: '4'
        wal_level: replica
        wal_log_hints: 'False'
  initdb:
  - auth-host: md5
  - auth-local: trust
  - wal-segsize: '1024'
postgresql:
  config_dir: /home/postgres/pgdata/conf
  custom_conf: /home/postgres/conf/postgresql.conf
  parameters:
    pg_stat_statements.track_utility: 'False'
    shared_buffers: 128MB
  pg_hba:
  - host     all             all             0.0.0.0/0                md5
  - host     all             all             ::/0                     md5
  - local    all             all                                     trust
  - host     all             all             127.0.0.1/32            trust
  - host     all             all             ::1/128                 trust
  - local     replication     all                                    trust
  - host      replication     all             0.0.0.0/0               md5
  - host      replication     all             ::/0                    md5

2023-06-01 04:16:21,078 - bootstrapping - INFO - Configuring bootstrap
2023-06-01 04:16:21,078 - bootstrapping - INFO - Configuring crontab
2023-06-01 04:16:21,078 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2023-06-01 04:16:21,079 - bootstrapping - INFO - Configuring certificate
2023-06-01 04:16:21,079 - bootstrapping - INFO - Generating ssl self-signed certificate
2023-06-01 04:16:25,781 - bootstrapping - INFO - Configuring pgbouncer
2023-06-01 04:16:25,781 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2023-06-01 04:16:25,782 - bootstrapping - INFO - Configuring standby-cluster
2023-06-01 04:16:25,782 - bootstrapping - INFO - Configuring pgqd
2023-06-01 04:16:25,782 - bootstrapping - INFO - Configuring log
2023-06-01 04:16:25,782 - bootstrapping - INFO - Configuring patroni
2023-06-01 04:16:25,885 - bootstrapping - INFO - Writing to file /run/postgres.yml
2023-06-01 04:16:25,885 - bootstrapping - INFO - Configuring pam-oauth2
2023-06-01 04:16:25,885 - bootstrapping - INFO - No PAM_OAUTH2 configuration was specified, skipping
2023-06-01 04:16:25,885 - bootstrapping - INFO - Configuring wal-e
2023-06-01 04:16:30,287 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2023-06-01 04:16:30,585 INFO: No PostgreSQL configuration items changed, nothing to reload.
2023-06-01 04:16:30,679 INFO: Lock owner: None; I am pgcluster1-postgresql-0
2023-06-01 04:16:30,978 INFO: trying to bootstrap a new cluster
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /home/postgres/pgdata/pgroot/data ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... 2023-06-01 04:16:41,179 INFO: Lock owner: None; I am pgcluster1-postgresql-0
2023-06-01 04:16:41,179 INFO: not healthy enough for leader race
2023-06-01 04:16:41,582 INFO: bootstrap in progress
2023-06-01 04:16:51,274 INFO: Lock owner: None; I am pgcluster1-postgresql-0
2023-06-01 04:16:51,274 INFO: not healthy enough for leader race
2023-06-01 04:16:51,274 INFO: bootstrap in progress
2023-06-01 04:16:52.680 UTC [211] FATAL:  could not write to file "pg_wal/xlogtemp.211": No space left on device
child process exited with exit code 1
initdb: removing contents of data directory "/home/postgres/pgdata/pgroot/data"
pg_ctl: database system initialization failed
2023-06-01 04:16:52,787 INFO: removing initialize key after failed attempt to bootstrap the cluster
2023-06-01 04:16:52,885 INFO: renaming data directory to /home/postgres/pgdata/pgroot/data.failed
Traceback (most recent call last):
  File "/usr/local/bin/patroni", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 144, in main
    return patroni_main()
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 136, in patroni_main
    abstract_main(Patroni, schema)
  File "/usr/local/lib/python3.10/dist-packages/patroni/daemon.py", line 181, in abstract_main
    controller.run()
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 106, in run
    super(Patroni, self).run()
  File "/usr/local/lib/python3.10/dist-packages/patroni/daemon.py", line 126, in run
    self._run_cycle()
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 109, in _run_cycle
    logger.info(self.ha.run_cycle())
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1770, in run_cycle
    info = self._run_cycle()
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1592, in _run_cycle
    return self.post_bootstrap()
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1483, in post_bootstrap
    self.cancel_initialization()
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1476, in cancel_initialization
    raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
/etc/runit/runsvdir/default/patroni: finished with code=1 signal=0
/etc/runit/runsvdir/default/patroni: sleeping 30 seconds
2023-06-01 04:17:27,190 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2023-06-01 04:17:27,483 INFO: No PostgreSQL configuration items changed, nothing to reload.
2023-06-01 04:17:27,578 INFO: Lock owner: None; I am pgcluster1-postgresql-0
2023-06-01 04:17:27,780 INFO: trying to bootstrap a new cluster
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory /home/postgres/pgdata/pgroot/data ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... 2023-06-01 04:17:37,987 INFO: Lock owner: None; I am pgcluster1-postgresql-0
2023-06-01 04:17:37,987 INFO: not healthy enough for leader race
2023-06-01 04:17:38,279 INFO: bootstrap in progress
2023-06-01 04:17:43.079 UTC [620] FATAL:  could not write to file "pg_wal/xlogtemp.620": No space left on device
child process exited with exit code 1
initdb: removing data directory "/home/postgres/pgdata/pgroot/data"
pg_ctl: database system initialization failed
2023-06-01 04:17:43,190 INFO: removing initialize key after failed attempt to bootstrap the cluster
Traceback (most recent call last):
  File "/usr/local/bin/patroni", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 144, in main
    return patroni_main()
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 136, in patroni_main
    abstract_main(Patroni, schema)
  File "/usr/local/lib/python3.10/dist-packages/patroni/daemon.py", line 181, in abstract_main
    controller.run()
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 106, in run
    super(Patroni, self).run()
  File "/usr/local/lib/python3.10/dist-packages/patroni/daemon.py", line 126, in run
    self._run_cycle()
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 109, in _run_cycle
    logger.info(self.ha.run_cycle())
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1770, in run_cycle
    info = self._run_cycle()
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1592, in _run_cycle
    return self.post_bootstrap()
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1483, in post_bootstrap
    self.cancel_initialization()
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1476, in cancel_initialization
    raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
/etc/runit/runsvdir/default/patroni: finished with code=1 signal=0
/etc/runit/runsvdir/default/patroni: sleeping 60 seconds
ahjing99 commented 1 year ago

Tried again with 2G storage, still cannot create, we need to check the storage user input and prevent the cluster from creating if it does not meet the minimum requirement

     `kbcli cluster create pgcluster2              --termination-policy=Halt              --monitor=false --enable-all-logs=false --cluster-definition=postgresql --cluster-version=postgresql-12.14.1 --set cpu=100m,memory=0.5Gi,replicas=2,storage=2Gi --namespace kb `

Cluster pgcluster2 create

log.txt

Y-Rookie commented 1 year ago

This is after adjusting the pg kernel parameters. The storage needs at least 4G, and the verification needs to be added later.

github-actions[bot] commented 1 year ago

This issue has been marked as stale because it has been open for 30 days with no activity

Y-Rookie commented 1 year ago

This is after adjusting the pg kernel parameters. The storage needs at least 4G, and the verification needs to be added later.

after revert the pg_wal_size to default 16M, it is normal