apecloud / kubeblocks

KubeBlocks is an open-source control plane software that runs and manages databases, message queues and other stateful applications on K8s.
https://kubeblocks.io
GNU Affero General Public License v3.0
2.08k stars 170 forks source link

[BUG]orioledb pod crash on 0.7/0.8 #6134

Closed ahjing99 closed 9 months ago

ahjing99 commented 10 months ago

➜ ~ kbcli version Kubernetes: v1.25.6 KubeBlocks: 0.7.2-beta.29 kbcli: 0.7.2-beta.29

This does not fail on beta.28, start to fail on beta.29

https://github.com/apecloud/kubeblocks/actions/runs/7245751906/job/19737439205


      `kbcli cluster create  etcd-ndhbor --termination-policy=WipeOut --monitoring-interval=0 --cluster-definition=etcd --enable-all-logs=false --set cpu=100m,memory=0.5Gi,replicas=3,storage=1Gi  --namespace default `

Info: --cluster-version is not specified, ClusterVersion etcd-v3.5.6 is applied by default
Cluster etcd-ndhbor created

`kbcli addon enable orioledb `

addon.extensions.kubeblocks.io/orioledb enabled

      `kbcli cluster create  orioledb-ndhbor --termination-policy=Halt --monitoring-interval=0 --cluster-definition=orioledb --enable-all-logs=false --cluster-version=orioledb-beta1 --set cpu=100m,memory=0.5Gi,replicas=1,storage=1Gi --service-reference name=etcdService,cluster=etcd-ndhbor --namespace default `

Cluster orioledb-ndhbor created

k get pod -n default
NAME                         READY   STATUS             RESTARTS      AGE
etcd-ndhbor-etcd-0           3/3     Running            0             9m49s
etcd-ndhbor-etcd-1           3/3     Running            0             9m49s
etcd-ndhbor-etcd-2           3/3     Running            0             9m49s
orioledb-ndhbor-orioledb-0   3/5     CrashLoopBackOff   6 (58s ago)   8m12s

➜  ~ k logs orioledb-ndhbor-orioledb-0 -n default -f
Defaulted container "postgresql" out of: postgresql, pgbouncer, metrics, kb-checkrole, config-manager
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /home/postgres/pgdata/pgroot/data ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... UTC
creating configuration files ... ok
running bootstrap script ... ok
sh: locale: not found
2023-12-18 11:11:32.305 UTC [37] WARNING:  no usable system locales were found
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

Success. You can now start the database server using:

    pg_ctl -D /home/postgres/pgdata/pgroot/data -l logfile start

initdb: warning: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
waiting for server to start.....2023-12-18 11:11:46.301 UTC [66] LOG:  OrioleDB public beta 1 started
2023-12-18 11:11:46.302 UTC [66] LOG:  starting PostgreSQL 14.7 OrioleDB public beta 1 PGTAG=patches14_14 alpine:3.17+clang build:2023-08-23T04:00:15+00:00 on x86_64-pc-linux-musl, compiled by Alpine clang version 15.0.7, 64-bit
2023-12-18 11:11:46.307 UTC [66] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-12-18 11:11:46.321 UTC [79] LOG:  database system was shut down at 2023-12-18 11:11:44 UTC
2023-12-18 11:11:46.322 UTC [80] LOG:  orioledb background writer started
2023-12-18 11:11:46.406 UTC [66] LOG:  database system is ready to accept connections
 done
server started

/usr/local/bin/docker-entrypoint.sh: running /docker-entrypoint-initdb.d/init.sql
CREATE EXTENSION

waiting for server to shut down....2023-12-18 11:11:47.705 UTC [66] LOG:  received fast shutdown request
2023-12-18 11:11:47.709 UTC [66] LOG:  aborting any active transactions
2023-12-18 11:11:47.709 UTC [80] LOG:  orioledb bgwriter is shut down
2023-12-18 11:11:47.799 UTC [66] LOG:  background worker "logical replication launcher" (PID 86) exited with exit code 1
2023-12-18 11:11:47.800 UTC [81] LOG:  shutting down
2023-12-18 11:11:47.804 UTC [81] LOG:  orioledb checkpoint 1 started
2023-12-18 11:11:48.111 UTC [81] LOG:  orioledb checkpoint 1 complete
2023-12-18 11:11:48.305 UTC [66] LOG:  database system is shut down
 done
server stopped

PostgreSQL init process complete; ready for start up.

2023-12-18 11:12:01,694 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:12:01,694 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:12:01,694 INFO: waiting on etcd
2023-12-18 11:12:06,858 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:12:06,859 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:12:06,859 INFO: waiting on etcd
2023-12-18 11:12:11,866 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:12:11,867 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:12:11,867 INFO: waiting on etcd
2023-12-18 11:12:16,875 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:12:16,876 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:12:16,876 INFO: waiting on etcd
2023-12-18 11:12:21,885 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:12:21,885 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:12:21,886 INFO: waiting on etcd
2023-12-18 11:12:26,895 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:12:26,895 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:12:26,895 INFO: waiting on etcd
2023-12-18 11:12:31,904 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:12:31,905 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:12:31,905 INFO: waiting on etcd
2023-12-18 11:12:37,054 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:12:37,055 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:12:37,055 INFO: waiting on etcd
2023-12-18 11:12:42,202 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:12:42,206 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:12:42,206 INFO: waiting on etcd
2023-12-18 11:12:47,214 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:12:47,215 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:12:47,215 INFO: waiting on etcd
2023-12-18 11:12:52,224 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:12:52,224 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:12:52,224 INFO: waiting on etcd
2023-12-18 11:12:57,233 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:12:57,234 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:12:57,234 INFO: waiting on etcd
2023-12-18 11:13:02,243 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:13:02,243 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:13:02,243 INFO: waiting on etcd
2023-12-18 11:13:07,252 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:13:07,253 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:13:07,253 INFO: waiting on etcd
2023-12-18 11:13:12,397 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:13:12,398 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:13:12,398 INFO: waiting on etcd
2023-12-18 11:13:17,558 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:13:17,558 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:13:17,558 INFO: waiting on etcd
2023-12-18 11:13:22,566 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:13:22,567 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:13:22,567 INFO: waiting on etcd
2023-12-18 11:13:27,574 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:13:27,575 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:13:27,575 INFO: waiting on etcd
2023-12-18 11:13:32,584 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:13:32,584 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:13:32,584 INFO: waiting on etcd
2023-12-18 11:13:37,590 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:13:37,591 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:13:37,591 INFO: waiting on etcd
2023-12-18 11:13:42,600 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:13:42,600 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:13:42,600 INFO: waiting on etcd
2023-12-18 11:13:47,768 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:13:47,769 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:13:47,769 INFO: waiting on etcd
2023-12-18 11:13:52,777 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:13:52,777 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:13:52,778 INFO: waiting on etcd
2023-12-18 11:13:57,786 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:13:57,787 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:13:57,787 INFO: waiting on etcd
2023-12-18 11:14:02,796 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:14:02,796 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:14:02,797 INFO: waiting on etcd
2023-12-18 11:14:07,802 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:14:07,803 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:14:07,803 INFO: waiting on etcd
2023-12-18 11:14:12,811 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:14:12,811 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:14:12,812 INFO: waiting on etcd
2023-12-18 11:14:17,978 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:14:17,979 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:14:17,979 INFO: waiting on etcd
2023-12-18 11:14:23,149 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:14:23,150 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:14:23,150 INFO: waiting on etcd
2023-12-18 11:14:28,158 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:14:28,159 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:14:28,159 INFO: waiting on etcd
2023-12-18 11:14:33,166 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:14:33,167 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:14:33,167 INFO: waiting on etcd
2023-12-18 11:14:38,176 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:14:38,176 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:14:38,176 INFO: waiting on etcd
2023-12-18 11:14:43,182 WARNING: failed to resolve host etcd-ndhbor-etcd.default.svc:2379: [Errno -2] Name does not resolve
2023-12-18 11:14:43,182 ERROR: Failed to get list of machines from http://[etcd-ndhbor-etcd.default.svc:2379]:2379/v3beta: LocationParseError('Failed to parse: http://[etcd-ndhbor-etcd.default.svc:2379]:2379/version')
2023-12-18 11:14:43,182 INFO: waiting on etcd
failed to create fsnotify watcher: too many open files%
1aal commented 10 months ago

The key error msg is failed to create fsnotify watcher: too many open files%. I try in my local k3d cluster and it's OK image It seems to be a crash caused by the limitation of the fs.inotify.max_user_instances parameter in the IDC. I will try it in IDC.

ahjing99 commented 9 months ago

0.8 also has the same problem