emqx / emqx-operator

A Kubernetes Operator for EMQX
https://www.emqx.com
Apache License 2.0
211 stars 64 forks source link

Listener(load balancer) is losing - has not ready endpoints (addresses) after emqx pod restart #996

Closed anzerozman closed 10 months ago

anzerozman commented 10 months ago

Description of the bug: After upgrade of emqx operator and emqx image to newest version (from apiVersion apps.emqx.io/v2alpha1 to apps.emqx.io/v2beta1) we are facing with the issue that sometimes pods are not any more binded to loadbalancer service. It happens if emqx pod is restarted by some reason after deployment. Emqx pod is up and running after restart but StatefulSets has pending status for that pod. Listeners (LoadBalancer) service does not have endpoint for that pod any more after restart.

To Reproduce This can be easy reproduced also on minikube on emqx-operator version: 2.2.x (tested on 2.2.4 - 2.2.10) and emqx version: 5.3.2 or later, if you e.g. manually delete emqx pod or if you descale/scale Statefulset. Pod is ready up and running, but stateful set has pending status for that pod forever.

This is not the case in previous versions (apiVersion: apps.emqx.io/v2alpha1) - pods are binded to loadbalancer without any issue if you e.g. delete manually one of the emqx pods or descale/scale Statefulset (operator version 2.1.2, emqx version 5.0.24). Listener/LoadBalancer service is binded with the pod correctly.

The only way to fix that issue is destroying/recreating emqx pods.

Environment details::

Thank you for your response, Anže.

Rory-Z commented 10 months ago

Hi @anzerozman I'm sorry, I don't understand. you side "if you e.g. manually delete emqx pod or if you descale/scale Statefulset. Pod is ready up and running, but stateful set has pending status for that pod forever.", and you also side "The only way to fix that issue is destroying/recreating emqx pods." This sounds contradictory.

I'm deploy EMQX CR in my minikube cluster, and I'm manually delete some one pod, and wait it recreate, looks good.

PS: I don't recommend scale statefulset directly. It will be tuned back by emqx operator. You should modify EMQX CR.

anzerozman commented 10 months ago

Yes, after deletion emqx pod is successfully recreated in both cases - in versions 2.1.2 and also in version e.g. 2.2.5). Everything looks fine from pods perspective. I have example with 2 replicas. In version 2.1.2 both pods are bind with endpoints to load balancer correctly if I deleted one of them. But if you try reproduce it in never version of operator (v2beta1 - 2.2.x) after deleting one pod, pod is successfuly recreated but statefulsets has only 1/2 pods ready. And if you look at listener (LoadBalancer) service, only endpoint from ready pod is associated (recreated pod's endpoint is not exposed any more).

Rory-Z commented 10 months ago

in EMQX pods, I set the ReadinessGates, and EMQX operator controller will check if this pods is already in EMQX cluster, if they already joined, they will be ready. So if pods can not ready, I think there are two possibilities:

  1. EMQX operator is not work, so pods can not be ready.
  2. the EMQX in the pod is not join the EMQX cluster, you can running emqx ctl cluster status in this pod to check it.
anzerozman commented 10 months ago

I run it from one of pods: emqx@test-emqx-core-57bcb74d8d-1:/opt/emqx$ emqx ctl cluster status Cluster status: #{running_nodes => ['emqx@test-emqx-core-57bcb74d8d-0.test-emqx-headless.default.svc.cluster.local', 'emqx@test-emqx-core-57bcb74d8d-1.test-emqx-headless.default.svc.cluster.local'], stopped_nodes => []}

but "kubectl get statefulsets" says: NAME READY AGE test-emqx-core-57bcb74d8d 1/2 98m

And there is only one endpoint visible from: kubectl describe service test-emqx-listeners
Name: test-emqx-listeners Namespace: default Labels: apps.emqx.io/instance=test-emqx apps.emqx.io/managed-by=emqx-operator Annotations: apps.emqx.io/last-applied: UEsDBBQACAAIAAAAAAAAAAAAAAAAAAAAAAAIAAAAb3JpZ2luYWykVFtv8zYM/S98tjzLyb6ketsVGLBLtmXF0GYPskwHQmTJleSubeD/PtCXxMnSYfj6JpM0eXh4yCPIRt+jD9pZEP... service.beta.kubernetes.io/azure-load-balancer-resource-group: anze-test Selector: apps.emqx.io/db-role=core,apps.emqx.io/instance=test-emqx,apps.emqx.io/managed-by=emqx-operator Type: LoadBalancer IP Family Policy: SingleStack IP Families: IPv4 IP: 10.108.147.230 IPs: 10.108.147.230 IP: 127.0.0.1 LoadBalancer Ingress: 127.0.0.1 Port: tcp-mqtt 1883/TCP TargetPort: 1883/TCP NodePort: tcp-mqtt 32472/TCP Endpoints: 10.244.0.79:1883 Session Affinity: None External Traffic Policy: Cluster Events:

Br, Anze.

Rory-Z commented 10 months ago

Is the EMQX operator is running ? and could you please show those two EMQX pod status

anzerozman commented 10 months ago

kubectl get pods

NAME                                                     READY   STATUS    RESTARTS   AGE
emqx5-emqx-operator-controller-manager-94b888f65-ppxrr   1/1     Running   0          158m
test-emqx-core-57bcb74d8d-0                              1/1     Running   0          155m
test-emqx-core-57bcb74d8d-1                              1/1     Running   0          153m

POD0:

kubectl describe pod test-emqx-core-57bcb74d8d-0
Name:         test-emqx-core-57bcb74d8d-0
Namespace:    default
Priority:     0
Node:         minikube/192.168.49.2
Start Time:   Tue, 09 Jan 2024 12:42:12 +0100
Labels:       apps.emqx.io/db-role=core
              apps.emqx.io/instance=test-emqx
              apps.emqx.io/managed-by=emqx-operator
              apps.emqx.io/pod-template-hash=57bcb74d8d
              controller-revision-hash=test-emqx-core-57bcb74d8d-86f944c4f
              statefulset.kubernetes.io/pod-name=test-emqx-core-57bcb74d8d-0
Annotations:  <none>
Status:       Running
IP:           10.244.0.79
IPs:
  IP:           10.244.0.79
Controlled By:  StatefulSet/test-emqx-core-57bcb74d8d
Containers:
  emqx:
    Container ID:   docker://fb861131f9c84be7670a8613ee05f6490d0f6fd9279276eb1757172c51191324
    Image:          emqx/emqx:5.3.2
    Image ID:       docker-pullable://emqx/emqx@sha256:858305e7b0b33b28abbc29bb0063b193e9b127224ffc284a36584f910cf699d0
    Port:           18083/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Tue, 09 Jan 2024 12:42:13 +0100
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:dashboard/status delay=60s timeout=1s period=30s #success=1 #failure=3
    Readiness:      http-get http://:dashboard/status delay=10s timeout=1s period=5s #success=1 #failure=12
    Environment:
      EMQX_DASHBOARD__LISTENERS__HTTP__BIND:                         18083
      POD_NAME:                                                      test-emqx-core-57bcb74d8d-0 (v1:metadata.name)
      EMQX_CLUSTER__DISCOVERY_STRATEGY:                              dns
      EMQX_CLUSTER__DNS__RECORD_TYPE:                                srv
      EMQX_CLUSTER__DNS__NAME:                                       test-emqx-headless.default.svc.cluster.local
      EMQX_HOST:                                                     $(POD_NAME).$(EMQX_CLUSTER__DNS__NAME)
      EMQX_NODE__DATA_DIR:                                           data
      EMQX_NODE__ROLE:                                               core
      EMQX_NODE__COOKIE:                                             <set to the key 'node_cookie' in secret 'test-emqx-node-cookie'>  Optional: false
      EMQX_API_KEY__BOOTSTRAP_FILE:                                  "/opt/emqx/data/bootstrap_api_key"
      EMQX_DASHBOARD__DEFAULT_USERNAME:                              test
      EMQX_DASHBOARD__DEFAULT_PASSWORD:                              test
      EMQX_LISTENERS__WS__DEFAULT__ENABLE:                           false
      EMQX_LISTENERS__WSS__DEFAULT__ENABLE:                          false
      EMQX_AUTHENTICATION__1__MECHANISM:                             password_based
      EMQX_AUTHENTICATION__1__BACKEND:                               built_in_database
      EMQX_AUTHENTICATION__1__PASSWORD_HASH_ALGORITHM__NAME:         bcrypt
      EMQX_AUTHENTICATION__1__PASSWORD_HASH_ALGORITHM__SALT_ROUNDS:  12
      EMQX_AUTHENTICATION__2__MECHANISM:                             jwt
      EMQX_AUTHENTICATION__2__USE_JWKS:                              false
      EMQX_AUTHENTICATION__2__ALGORITHM:                             hmac-based
      EMQX_AUTHENTICATION__2__SECRET:                                test
      EMQX_TELEMETRY__ENABLE:                                        false
      EMQX_AUTHENTICATION__2__VERIFY_CLAIMS:                         {edge_node_id: "${username}"}
      EMQX_CLUSTER__DISCOVERY_STRATEGY:                              dns
      EMQX_CLUSTER__DNS__RECORD_TYPE:                                srv
      EMQX_SYSMON__VM__LONG_SCHEDULE:                                disabled
      EMQX_LISTENERS__TCP__DEFAULT__ENABLE:                          false
      EMQX_LISTENERS__SSL__DEFAULT__ENABLE:                          false
      EMQX_LISTENERS__TCP__MQTT__BIND:                               "0.0.0.0:1883"
    Mounts:
      /opt/emqx/data from test-emqx-core-data (rw)
      /opt/emqx/data/bootstrap_api_key from bootstrap-api-key (ro,path="bootstrap_api_key")
      /opt/emqx/etc/emqx.conf from bootstrap-config (ro,path="emqx.conf")
      /opt/emqx/log from test-emqx-core-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nbg86 (ro)
Readiness Gates:
  Type                      Status
  apps.emqx.io/on-serving   True 
Conditions:
  Type                      Status
  apps.emqx.io/on-serving   True 
  Initialized               True 
  Ready                     True 
  ContainersReady           True 
  PodScheduled              True 
Volumes:
  test-emqx-core-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  test-emqx-core-data-test-emqx-core-57bcb74d8d-0
    ReadOnly:   false
  bootstrap-api-key:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  test-emqx-bootstrap-api-key
    Optional:    false
  bootstrap-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      test-emqx-configs
    Optional:  false
  test-emqx-core-log:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-nbg86:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

POD1:

kubectl describe pod test-emqx-core-57bcb74d8d-1
Name:         test-emqx-core-57bcb74d8d-1
Namespace:    default
Priority:     0
Node:         minikube/192.168.49.2
Start Time:   Tue, 09 Jan 2024 12:44:01 +0100
Labels:       apps.emqx.io/db-role=core
              apps.emqx.io/instance=test-emqx
              apps.emqx.io/managed-by=emqx-operator
              apps.emqx.io/pod-template-hash=57bcb74d8d
              controller-revision-hash=test-emqx-core-57bcb74d8d-86f944c4f
              statefulset.kubernetes.io/pod-name=test-emqx-core-57bcb74d8d-1
Annotations:  <none>
Status:       Running
IP:           10.244.0.81
IPs:
  IP:           10.244.0.81
Controlled By:  StatefulSet/test-emqx-core-57bcb74d8d
Containers:
  emqx:
    Container ID:   docker://eb10e152c397795ee95b4b4e8292b9ae3ee190561e6a5e2b166904790d125f34
    Image:          emqx/emqx:5.3.2
    Image ID:       docker-pullable://emqx/emqx@sha256:858305e7b0b33b28abbc29bb0063b193e9b127224ffc284a36584f910cf699d0
    Port:           18083/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Tue, 09 Jan 2024 12:44:02 +0100
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:dashboard/status delay=60s timeout=1s period=30s #success=1 #failure=3
    Readiness:      http-get http://:dashboard/status delay=10s timeout=1s period=5s #success=1 #failure=12
    Environment:
      EMQX_DASHBOARD__LISTENERS__HTTP__BIND:                         18083
      POD_NAME:                                                      test-emqx-core-57bcb74d8d-1 (v1:metadata.name)
      EMQX_CLUSTER__DISCOVERY_STRATEGY:                              dns
      EMQX_CLUSTER__DNS__RECORD_TYPE:                                srv
      EMQX_CLUSTER__DNS__NAME:                                       test-emqx-headless.default.svc.cluster.local
      EMQX_HOST:                                                     $(POD_NAME).$(EMQX_CLUSTER__DNS__NAME)
      EMQX_NODE__DATA_DIR:                                           data
      EMQX_NODE__ROLE:                                               core
      EMQX_NODE__COOKIE:                                             <set to the key 'node_cookie' in secret 'test-emqx-node-cookie'>  Optional: false
      EMQX_API_KEY__BOOTSTRAP_FILE:                                  "/opt/emqx/data/bootstrap_api_key"
      EMQX_DASHBOARD__DEFAULT_USERNAME:                              test
      EMQX_DASHBOARD__DEFAULT_PASSWORD:                              test
      EMQX_LISTENERS__WS__DEFAULT__ENABLE:                           false
      EMQX_LISTENERS__WSS__DEFAULT__ENABLE:                          false
      EMQX_AUTHENTICATION__1__MECHANISM:                             password_based
      EMQX_AUTHENTICATION__1__BACKEND:                               built_in_database
      EMQX_AUTHENTICATION__1__PASSWORD_HASH_ALGORITHM__NAME:         bcrypt
      EMQX_AUTHENTICATION__1__PASSWORD_HASH_ALGORITHM__SALT_ROUNDS:  12
      EMQX_AUTHENTICATION__2__MECHANISM:                             jwt
      EMQX_AUTHENTICATION__2__USE_JWKS:                              false
      EMQX_AUTHENTICATION__2__ALGORITHM:                             hmac-based
      EMQX_AUTHENTICATION__2__SECRET:                                test
      EMQX_TELEMETRY__ENABLE:                                        false
      EMQX_AUTHENTICATION__2__VERIFY_CLAIMS:                         {edge_node_id: "${username}"}
      EMQX_CLUSTER__DISCOVERY_STRATEGY:                              dns
      EMQX_CLUSTER__DNS__RECORD_TYPE:                                srv
      EMQX_SYSMON__VM__LONG_SCHEDULE:                                disabled
      EMQX_LISTENERS__TCP__DEFAULT__ENABLE:                          false
      EMQX_LISTENERS__SSL__DEFAULT__ENABLE:                          false
      EMQX_LISTENERS__TCP__MQTT__BIND:                               "0.0.0.0:1883"
    Mounts:
      /opt/emqx/data from test-emqx-core-data (rw)
      /opt/emqx/data/bootstrap_api_key from bootstrap-api-key (ro,path="bootstrap_api_key")
      /opt/emqx/etc/emqx.conf from bootstrap-config (ro,path="emqx.conf")
      /opt/emqx/log from test-emqx-core-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hlgvr (ro)
Readiness Gates:
  Type                      Status
  apps.emqx.io/on-serving   False 
Conditions:
  Type                      Status
  apps.emqx.io/on-serving   False 
  Initialized               True 
  Ready                     False 
  ContainersReady           True 
  PodScheduled              True 
Volumes:
  test-emqx-core-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  test-emqx-core-data-test-emqx-core-57bcb74d8d-1
    ReadOnly:   false
  bootstrap-api-key:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  test-emqx-bootstrap-api-key
    Optional:    false
  bootstrap-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      test-emqx-configs
    Optional:  false
  test-emqx-core-log:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-hlgvr:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>
Rory-Z commented 10 months ago

OK, please share the EMQX operator log and the EMQX customer resource status. And you can use Markdown to format you context

anzerozman commented 10 months ago

Hi,

please find logs attached. It seems there was restart during the night so I can reproduce it once more and send logs.

Thank you, Anže emqx-1.log emqx-0.log emqx-operator-controller-manager.log

Rory-Z commented 10 months ago

I found this message in EMQX operator log: etcdserver: request timed out, maybe that is the reason of why the EMQX operator is not work normal

anzerozman commented 10 months ago

Hi,

I reproduced it. Please find attached logs. That error you mention is probably not related to it (my pc went to sleep mode before that log...).

kubectl get statefulsets                                                   
NAME                        READY   AGE
test-emqx-core-57bcb74d8d   1/2     140m

operator-manager.log emqx0.log emqx1.log

Rory-Z commented 10 months ago

Could you please enable debug log for EMQX operator, and retry it, and show the debug log of EMQX operator. You can set development = true in Helm chart value to enable debug log.

And when this issue is happen, please show EMQX customer resource, you can running kubectl get emqx $name -o json

anzerozman commented 10 months ago

Hi,

thx for response. Please find attached logs (I do not see any new log inside operator after recreation of the pod).

kubectl get emqx $name -o json         

{
    "apiVersion": "v1",
    "items": [
        {
            "apiVersion": "apps.emqx.io/v2beta1",
            "kind": "EMQX",
            "metadata": {
                "annotations": {
                    "apps.emqx.io/last-emqx-configuration": ""
                },
                "creationTimestamp": "2024-01-10T14:29:29Z",
                "generation": 2,
                "name": "test-emqx",
                "namespace": "default",
                "resourceVersion": "905163",
                "uid": "90bfdbd3-3450-48d9-b845-193e37e8ccfd"
            },
            "spec": {
                "clusterDomain": "cluster.local",
                "config": {
                    "mode": "Merge"
                },
                "coreTemplate": {
                    "metadata": {},
                    "spec": {
                        "containerSecurityContext": {
                            "runAsGroup": 1000,
                            "runAsNonRoot": true,
                            "runAsUser": 1000
                        },
                        "env": [
                            {
                                "name": "EMQX_DASHBOARD__DEFAULT_USERNAME",
                                "value": "test"
                            },
                            {
                                "name": "EMQX_DASHBOARD__DEFAULT_PASSWORD",
                                "value": "test"
                            },
                            {
                                "name": "EMQX_LISTENERS__WS__DEFAULT__ENABLE",
                                "value": "false"
                            },
                            {
                                "name": "EMQX_LISTENERS__WSS__DEFAULT__ENABLE",
                                "value": "false"
                            },
                            {
                                "name": "EMQX_AUTHENTICATION__1__MECHANISM",
                                "value": "password_based"
                            },
                            {
                                "name": "EMQX_AUTHENTICATION__1__BACKEND",
                                "value": "built_in_database"
                            },
                            {
                                "name": "EMQX_AUTHENTICATION__1__PASSWORD_HASH_ALGORITHM__NAME",
                                "value": "bcrypt"
                            },
                            {
                                "name": "EMQX_AUTHENTICATION__1__PASSWORD_HASH_ALGORITHM__SALT_ROUNDS",
                                "value": "12"
                            },
                            {
                                "name": "EMQX_AUTHENTICATION__2__MECHANISM",
                                "value": "jwt"
                            },
                            {
                                "name": "EMQX_AUTHENTICATION__2__USE_JWKS",
                                "value": "false"
                            },
                            {
                                "name": "EMQX_AUTHENTICATION__2__ALGORITHM",
                                "value": "hmac-based"
                            },
                            {
                                "name": "EMQX_AUTHENTICATION__2__SECRET",
                                "value": "test"
                            },
                            {
                                "name": "EMQX_TELEMETRY__ENABLE",
                                "value": "false"
                            },
                            {
                                "name": "EMQX_AUTHENTICATION__2__VERIFY_CLAIMS",
                                "value": "{edge_node_id: \"${username}\"}"
                            },
                            {
                                "name": "EMQX_CLUSTER__DISCOVERY_STRATEGY",
                                "value": "dns"
                            },
                            {
                                "name": "EMQX_CLUSTER__DNS__RECORD_TYPE",
                                "value": "srv"
                            },
                            {
                                "name": "EMQX_SYSMON__VM__LONG_SCHEDULE",
                                "value": "disabled"
                            },
                            {
                                "name": "EMQX_LISTENERS__TCP__DEFAULT__ENABLE",
                                "value": "false"
                            },
                            {
                                "name": "EMQX_LISTENERS__SSL__DEFAULT__ENABLE",
                                "value": "false"
                            },
                            {
                                "name": "EMQX_LISTENERS__TCP__MQTT__BIND",
                                "value": "\"0.0.0.0:1883\""
                            }
                        ],
                        "livenessProbe": {
                            "failureThreshold": 3,
                            "httpGet": {
                                "path": "/status",
                                "port": "dashboard"
                            },
                            "initialDelaySeconds": 60,
                            "periodSeconds": 30
                        },
                        "podSecurityContext": {
                            "fsGroup": 1000,
                            "fsGroupChangePolicy": "Always",
                            "runAsGroup": 1000,
                            "runAsUser": 1000,
                            "supplementalGroups": [
                                1000
                            ]
                        },
                        "readinessProbe": {
                            "failureThreshold": 12,
                            "httpGet": {
                                "path": "/status",
                                "port": "dashboard"
                            },
                            "initialDelaySeconds": 10,
                            "periodSeconds": 5
                        },
                        "replicas": 2,
                        "resources": {},
                        "volumeClaimTemplates": {
                            "accessModes": [
                                "ReadWriteOnce"
                            ],
                            "resources": {
                                "requests": {
                                    "storage": "20Mi"
                                }
                            }
                        }
                    }
                },
                "image": "emqx/emqx:5.3.2",
                "listenersServiceTemplate": {
                    "enabled": true,
                    "metadata": {
                        "annotations": {
                            "service.beta.kubernetes.io/azure-load-balancer-resource-group": "anze-test"
                        }
                    },
                    "spec": {
                        "loadBalancerIP": "127.0.0.1",
                        "ports": [
                            {
                                "name": "tcp-mqtt",
                                "port": 1883,
                                "protocol": "TCP",
                                "targetPort": 1883
                            }
                        ],
                        "type": "LoadBalancer"
                    }
                },
                "replicantTemplate": {
                    "metadata": {},
                    "spec": {
                        "containerSecurityContext": {
                            "runAsGroup": 1000,
                            "runAsNonRoot": true,
                            "runAsUser": 1000
                        },
                        "livenessProbe": {
                            "failureThreshold": 3,
                            "httpGet": {
                                "path": "/status",
                                "port": "dashboard"
                            },
                            "initialDelaySeconds": 60,
                            "periodSeconds": 30
                        },
                        "podSecurityContext": {
                            "fsGroup": 1000,
                            "fsGroupChangePolicy": "Always",
                            "runAsGroup": 1000,
                            "runAsUser": 1000,
                            "supplementalGroups": [
                                1000
                            ]
                        },
                        "readinessProbe": {
                            "failureThreshold": 12,
                            "httpGet": {
                                "path": "/status",
                                "port": "dashboard"
                            },
                            "initialDelaySeconds": 10,
                            "periodSeconds": 5
                        },
                        "replicas": 0,
                        "resources": {}
                    }
                },
                "revisionHistoryLimit": 3,
                "updateStrategy": {
                    "evacuationStrategy": {
                        "connEvictRate": 1000,
                        "sessEvictRate": 1000,
                        "waitTakeover": 10
                    },
                    "initialDelaySeconds": 10,
                    "type": "Recreate"
                }
            },
            "status": {
                "conditions": [
                    {
                        "lastTransitionTime": "2024-01-10T14:30:10Z",
                        "message": "Create new replicaSet",
                        "reason": "CreateNewReplicaSet",
                        "status": "True",
                        "type": "ReplicantNodesProgressing"
                    },
                    {
                        "lastTransitionTime": "2024-01-10T14:30:10Z",
                        "message": "Core nodes is ready",
                        "reason": "CoreNodesReady",
                        "status": "True",
                        "type": "CoreNodesReady"
                    },
                    {
                        "lastTransitionTime": "2024-01-10T14:29:31Z",
                        "message": "Create new statefulSet",
                        "reason": "CreateNewStatefulSet",
                        "status": "True",
                        "type": "CoreNodesProgressing"
                    }
                ],
                "coreNodes": [
                    {
                        "controllerUID": "c9dc9bf5-f46b-4cee-b563-a2a1f2b48c93",
                        "edition": "Opensource",
                        "node": "emqx@test-emqx-core-57bcb74d8d-1.test-emqx-headless.default.svc.cluster.local",
                        "node_status": "running",
                        "otp_release": "25.3.2-2/13.2.2",
                        "podUID": "b4c79037-c5e7-4d0b-9110-6ad48d57e7ac",
                        "role": "core",
                        "uptime": 19904,
                        "version": "5.3.2"
                    },
                    {
                        "controllerUID": "c9dc9bf5-f46b-4cee-b563-a2a1f2b48c93",
                        "edition": "Opensource",
                        "node": "emqx@test-emqx-core-57bcb74d8d-0.test-emqx-headless.default.svc.cluster.local",
                        "node_status": "running",
                        "otp_release": "25.3.2-2/13.2.2",
                        "podUID": "d3e3fb8e-1dc0-4cd8-a3b8-194cb0aeb1aa",
                        "role": "core",
                        "uptime": 19906,
                        "version": "5.3.2"
                    }
                ],
                "coreNodesStatus": {
                    "currentReplicas": 2,
                    "currentRevision": "57bcb74d8d",
                    "readyReplicas": 2,
                    "replicas": 2,
                    "updateReplicas": 2,
                    "updateRevision": "57bcb74d8d"
                },
                "replicantNodesStatus": {
                    "currentRevision": "5d7b4558d5",
                    "updateRevision": "5d7b4558d5"
                }
            }
        }
    ],
    "kind": "List",
    "metadata": {
        "resourceVersion": "",
        "selfLink": ""
    }
}

operator-controller-manager.log emqx1-recreated.log emqx0.log

Maybe this also helps:

kubectl get events -o custom-columns=TS:.firstTimestamp,Count:.count,From:.source.component,Type:.type,Reason:.reason,Message:.message --sort-by='.firstTimestamp' 
TS                     Count   From                                                                                           Type      Reason                    Message
2024-01-10T14:29:31Z   1       statefulset-controller                                                                         Normal    SuccessfulCreate          create Claim test-emqx-core-data-test-emqx-core-57bcb74d8d-0 Pod test-emqx-core-57bcb74d8d-0 in StatefulSet test-emqx-core-57bcb74d8d success
2024-01-10T14:29:31Z   1       persistentvolume-controller                                                                    Normal    ExternalProvisioning      waiting for a volume to be created, either by external provisioner "k8s.io/minikube-hostpath" or manually created by system administrator
2024-01-10T14:29:31Z   1       k8s.io/minikube-hostpath_minikube_4072228a-9bbc-4220-a8cc-258ba512cb4d                         Normal    ProvisioningSucceeded     Successfully provisioned volume pvc-40fe2dda-287e-457d-82b1-99431d404065
2024-01-10T14:29:31Z   1       k8s.io/minikube-hostpath_minikube_4072228a-9bbc-4220-a8cc-258ba512cb4d                         Normal    Provisioning              External provisioner is provisioning volume for claim "default/test-emqx-core-data-test-emqx-core-57bcb74d8d-1"
2024-01-10T14:29:31Z   1       k8s.io/minikube-hostpath_minikube_4072228a-9bbc-4220-a8cc-258ba512cb4d                         Normal    ProvisioningSucceeded     Successfully provisioned volume pvc-dee67a1f-6536-4efa-9034-70bb1803fd04
2024-01-10T14:29:31Z   1       default-scheduler                                                                              Warning   FailedScheduling          0/1 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
2024-01-10T14:29:31Z   1       k8s.io/minikube-hostpath_minikube_4072228a-9bbc-4220-a8cc-258ba512cb4d                         Normal    Provisioning              External provisioner is provisioning volume for claim "default/test-emqx-core-data-test-emqx-core-57bcb74d8d-0"
2024-01-10T14:29:31Z   2       persistentvolume-controller                                                                    Normal    ExternalProvisioning      waiting for a volume to be created, either by external provisioner "k8s.io/minikube-hostpath" or manually created by system administrator
2024-01-10T14:29:31Z   1       default-scheduler                                                                              Warning   FailedScheduling          0/1 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
2024-01-10T14:29:31Z   1       statefulset-controller                                                                         Normal    SuccessfulCreate          create Claim test-emqx-core-data-test-emqx-core-57bcb74d8d-1 Pod test-emqx-core-57bcb74d8d-1 in StatefulSet test-emqx-core-57bcb74d8d success
2024-01-10T14:29:31Z   1       statefulset-controller                                                                         Normal    SuccessfulCreate          create Pod test-emqx-core-57bcb74d8d-0 in StatefulSet test-emqx-core-57bcb74d8d successful
2024-01-10T14:29:31Z   2       statefulset-controller                                                                         Normal    SuccessfulCreate          create Pod test-emqx-core-57bcb74d8d-1 in StatefulSet test-emqx-core-57bcb74d8d successful
2024-01-10T14:29:32Z   1       default-scheduler                                                                              Normal    Scheduled                 Successfully assigned default/test-emqx-core-57bcb74d8d-0 to minikube
2024-01-10T14:29:32Z   1       default-scheduler                                                                              Normal    Scheduled                 Successfully assigned default/test-emqx-core-57bcb74d8d-1 to minikube
2024-01-10T14:29:34Z   1       kubelet                                                                                        Normal    Started                   Started container emqx
2024-01-10T14:29:34Z   1       kubelet                                                                                        Normal    Pulled                    Container image "emqx/emqx:5.3.2" already present on machine
2024-01-10T14:29:34Z   1       kubelet                                                                                        Normal    Created                   Created container emqx
2024-01-10T14:29:34Z   1       kubelet                                                                                        Normal    Started                   Started container emqx
2024-01-10T14:29:34Z   1       kubelet                                                                                        Normal    Pulled                    Container image "emqx/emqx:5.3.2" already present on machine
2024-01-10T14:29:34Z   1       kubelet                                                                                        Normal    Created                   Created container emqx
2024-01-10T14:29:35Z   25      emqx-controller                                                                                Warning   FailedToGetNodeStatuses   failed to get node statues by API: failed to get API http://10.244.0.92:18083/api/v5/nodes: failed to request API: Get "http://10.244.0.92:18083/api/v5/nodes": dial tcp 10.244.0.92:18083: connect: connection refused
2024-01-10T14:29:44Z   4       kubelet                                                                                        Warning   Unhealthy                 Readiness probe failed: Get "http://10.244.0.92:18083/status": dial tcp 10.244.0.92:18083: connect: connection refused
2024-01-10T14:29:48Z   4       kubelet                                                                                        Warning   Unhealthy                 Readiness probe failed: Get "http://10.244.0.91:18083/status": dial tcp 10.244.0.91:18083: connect: connection refused
2024-01-10T14:30:04Z   1       kubelet                                                                                        Warning   Unhealthy                 Readiness probe failed: Get "http://10.244.0.91:18083/status": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2024-01-10T14:31:07Z   1       kubelet                                                                                        Normal    Killing                   Stopping container emqx
2024-01-10T14:31:07Z   1       endpoint-controller                                                                            Warning   FailedToUpdateEndpoint    Failed to update endpoint default/test-emqx-dashboard: Operation cannot be fulfilled on endpoints "test-emqx-dashboard": the object has been modified; please apply your changes to the latest version and try again
2024-01-10T14:31:07Z   1       endpoint-controller                                                                            Warning   FailedToUpdateEndpoint    Failed to update endpoint default/test-emqx-listeners: Operation cannot be fulfilled on endpoints "test-emqx-listeners": the object has been modified; please apply your changes to the latest version and try again
2024-01-10T14:31:09Z   1       default-scheduler                                                                              Normal    Scheduled                 Successfully assigned default/test-emqx-core-57bcb74d8d-1 to minikube
2024-01-10T14:31:10Z   1       kubelet                                                                                        Normal    Created                   Created container emqx
2024-01-10T14:31:10Z   1       kubelet                                                                                        Normal    Pulled                    Container image "emqx/emqx:5.3.2" already present on machine
2024-01-10T14:31:10Z   1       kubelet                                                                                        Normal    Started                   Started container emqx
Rory-Z commented 10 months ago

OK, check the conditions and the replicantTemplate.replicas = 0 in EMQX, I think this issue is likes to https://github.com/emqx/emqx-operator/issues/1002, could you please deploy EMQX operator 2.2.12 and retry ?

anzerozman commented 10 months ago

Hi,

thank you very much @Rory-Z, it works now without problem in v2.2.12! I have only one more question... I noticed that from operator version > 2.2.5, besides the enabled port all other default ports are visible/avaliable to listener (load balancer) service, although they are disabled and not specified?

kubectl describe service test-emqx-listeners 
Name:                     test-emqx-listeners
Namespace:                default
Labels:                   apps.emqx.io/instance=test-emqx
                          apps.emqx.io/managed-by=emqx-operator
Annotations:              apps.emqx.io/last-applied:
                            UEsDBBQACAAIAAAAAAAAAAAAAAAAAAAAAAAIAAAAb3JpZ2luYWykVduO3DYM/Rc+W64vs41Hb70CBXqZttNFkUwfZJleCCtLjiRvshn43wv6MvY4k26RvMkkfXhIHkpnEK26R+eVNc...
                          service.beta.kubernetes.io/azure-load-balancer-resource-group: anze-test
Selector:                 apps.emqx.io/db-role=core,apps.emqx.io/instance=test-emqx,apps.emqx.io/managed-by=emqx-operator
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       10.108.182.245
IPs:                      10.108.182.245
IP:                       127.0.0.1
LoadBalancer Ingress:     127.0.0.1
Port:                     tcp-mqtt  1883/TCP
TargetPort:               1883/TCP
NodePort:                 tcp-mqtt  32623/TCP
Endpoints:                10.244.0.102:1883,10.244.0.103:1883
Port:                     ssl-default  8883/TCP
TargetPort:               8883/TCP
NodePort:                 ssl-default  30891/TCP
Endpoints:                10.244.0.102:8883,10.244.0.103:8883
Port:                     ws-default  8083/TCP
TargetPort:               8083/TCP
NodePort:                 ws-default  31260/TCP
Endpoints:                10.244.0.102:8083,10.244.0.103:8083
Port:                     wss-default  8084/TCP
TargetPort:               8084/TCP
NodePort:                 wss-default  31657/TCP
Endpoints:                10.244.0.102:8084,10.244.0.103:8084
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

Thx, Anže.

Rory-Z commented 10 months ago

besides the enabled port all other default ports are visible/avaliable to listener (load balancer) service, although they are disabled and not specified

Yes, because they are default enable in EMQX, if you want disable it, you can set listeners.tcp.default.enable = false in .spec.config.data

anzerozman commented 10 months ago

Thx. I have it disabled in coreTemplate.spec.env(you can see it in one of previous post). Should I move this spec to.spec.config.data then?

Rory-Z commented 10 months ago

Thx. I have it disabled in coreTemplate.spec.env(you can see it in one of previous post). Should I move this spec to.spec.config.data then?

Yes, the .spec.config.data is better

anzerozman commented 10 months ago

Ok thx. It was confusing because also log from emqx says:

Listener ssl:default is NOT started due to: disabled.
Listener tcp:default is NOT started due to: disabled.
Listener tcp:mqtt on 0.0.0.0:1883 started.
Listener ws:default is NOT started due to: disabled.
Listener wss:default is NOT started due to: disabled.
Listener http:dashboard on :18083 started.
Rory-Z commented 10 months ago

EMQX operator load the EMQX config in .spec.config.data, and watch this config. When the EMQX operator find the some listeners is disabled, they will disable that port in service, but EMQX operator can not load EMQX config in pod env.

I recommend you to use .spec.config.data more, if you want update EMQX config, just change .spec.config.data is right. If you use the env in pod, you only can restart pod to update EMQX config