StackStorm / stackstorm-k8s

K8s Helm Chart that codifies StackStorm (aka "IFTTT for Ops" Highly Availability fleet as a simple to use reproducible infrastructure-as-code app
Apache License 2.0
105 stars 107 forks source link

Failed to deploy stackStorm HA through with kubeadm #285

Open simonli866 opened 2 years ago

simonli866 commented 2 years ago

image image image

Two PODS cannot be started and the Web interface cannot be accessed, but the console interface shows that the installation is successful

the error log in here:

[root@centos-master ~]# kubectl logs stackstorm-redis-node-0
error: a container name must be specified for pod stackstorm-redis-node-0, choose one of: [redis sentinel]
[root@centos-master ~]# kubectl logs stackstorm-redis-node-0 -c redis
I am master
redis 02:25:38.52 INFO  ==> ** Starting Redis **
1:C 18 Feb 2022 02:25:38.548 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 18 Feb 2022 02:25:38.548 # Redis version=6.0.9, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 18 Feb 2022 02:25:38.548 # Configuration loaded
1:M 18 Feb 2022 02:25:38.554 * Running mode=standalone, port=6379.
1:M 18 Feb 2022 02:25:38.554 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 18 Feb 2022 02:25:38.554 # Server initialized
1:M 18 Feb 2022 02:25:38.554 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo madvise > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled (set to 'madvise' or 'never').
1:M 18 Feb 2022 02:25:38.560 * Reading RDB preamble from AOF file...
1:M 18 Feb 2022 02:25:38.560 * Loading RDB produced by version 6.0.9
1:M 18 Feb 2022 02:25:38.560 * RDB age 881 seconds
1:M 18 Feb 2022 02:25:38.560 * RDB memory usage when created 1.87 Mb
1:M 18 Feb 2022 02:25:38.560 * RDB has an AOF tail
1:M 18 Feb 2022 02:25:38.560 * Reading the remaining AOF tail...
1:M 18 Feb 2022 02:25:38.561 * DB loaded from append only file: 0.007 seconds
1:M 18 Feb 2022 02:25:38.561 * Ready to accept connections
1:M 18 Feb 2022 02:26:19.887 * Replica asks for synchronization
1:M 18 Feb 2022 02:26:19.887 * Full resync requested by replica
1:M 18 Feb 2022 02:26:19.887 * Replication backlog created, my new replication IDs are '356f8e0eaf71f966ffce779720a7be37b39e79f9' and '0000000000000000000000000000000000000000'
1:M 18 Feb 2022 02:26:19.887 * Starting BGSAVE for SYNC with target: disk
1:M 18 Feb 2022 02:26:19.888 * Background saving started by pid 58
58:C 18 Feb 2022 02:26:19.893 * DB saved on disk
58:C 18 Feb 2022 02:26:19.894 * RDB: 0 MB of memory used by copy-on-write
1:M 18 Feb 2022 02:26:19.984 * Background saving terminated with success
1:M 18 Feb 2022 02:26:19.985 * Synchronization with replica succeeded
1:M 18 Feb 2022 02:27:17.687 * Replica asks for synchronization
1:M 18 Feb 2022 02:27:17.687 * Full resync requested by replica
1:M 18 Feb 2022 02:27:17.687 * Starting BGSAVE for SYNC with target: disk
1:M 18 Feb 2022 02:27:17.688 * Background saving started by pid 169
169:C 18 Feb 2022 02:27:18.486 * DB saved on disk
169:C 18 Feb 2022 02:27:18.490 * RDB: 0 MB of memory used by copy-on-write
1:M 18 Feb 2022 02:27:18.566 * Background saving terminated with success
1:M 18 Feb 2022 02:27:18.576 * Synchronization with replica succeeded
[root@centos-master ~]# kubectl logs stackstorm-redis-node-0 -c sentinel
Could not connect to Redis at Connection refused
arm4b commented 2 years ago

Instead of kubectl logs stackstorm-redis-node-0 -c sentinel, use kubectl logs --previous stackstorm-redis-node-0 -c sentinel. I suspect the most important messages weren't included for the failing container.

It's interesting that redis-node-2 has finally reached it's alive and up state, while others are down. Can you compare those for any differences and anomalies, including logs from other pods?

Also show the full kubectl describe for the failing pods. kubectl get pv,pvc,sc would help too.

Could you describe the resources (memory/cpu/storage) you have on that K8s cluster?

simonli866 commented 2 years ago

Sorry, this problem can not be repeated every time, I will update the content when it is repeated next time

simonli866 commented 2 years ago

图片 The question of Redis arises again

simonli866 commented 2 years ago


arms11 commented 2 years ago

@ShimingLee is behavior different/better when the redis is deployed directly from bitnami with it being disabled in stackstorm-ha values.yaml? You may have to provide the connection string in st2.conf (via configmap).

simonli866 commented 2 years ago

@arms11 I use the bitnami directly. why not use values.yaml directly? Why need to configure connection strings?

arm4b commented 2 years ago

Another advantage of what @arms11 suggested is that trying the Redis chart in isolation could help to pinpoint the root cause of the issue so you don't need to re-deploy the st2 cluster every time, but deal with Redis issue only.

BTW could you provide more info about your K8s environment and resources?