Closed onedr0p closed 7 months ago
It's also worth noting I only see these svc, nothing related to the core pods.
❯ k get svc -n database
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
emqx-operator-controller-manager-metrics-service ClusterIP 10.43.9.231 <none> 8080/TCP 134m
emqx-operator-webhook-service ClusterIP 10.43.45.107 <none> 443/TCP 134m
kubectl -n database get emqx emqx5 -o json
{
"apiVersion": "apps.emqx.io/v2beta1",
"kind": "EMQX",
"metadata": {
"annotations": {
"apps.emqx.io/last-emqx-configuration": "log.console.level = debug\n",
"kubectl.kubernetes.io/last-applied-configuration": "{\"apiVersion\":\"apps.emqx.io/v2beta1\",\"kind\":\"EMQX\",\"metadata\":{\"annotations\":{},\"name\":\"emqx5\",\"namespace\":\"database\"},\"spec\":{\"config\":{\"data\":\"log.console.level = debug\\n\"},\"coreTemplate\":{\"spec\":{\"replicas\":3}},\"image\":\"public.ecr.aws/emqx/emqx:5.6.0\"}}\n"
},
"creationTimestamp": "2024-04-09T18:11:01Z",
"generation": 2,
"name": "emqx5",
"namespace": "database",
"resourceVersion": "52022722",
"uid": "62eec439-9224-4eeb-a106-434ec94b16b2"
},
"spec": {
"clusterDomain": "cluster.local",
"config": {
"data": "log.console.level = debug\n",
"mode": "Merge"
},
"coreTemplate": {
"metadata": {},
"spec": {
"containerSecurityContext": {
"runAsGroup": 1000,
"runAsNonRoot": true,
"runAsUser": 1000
},
"livenessProbe": {
"failureThreshold": 3,
"httpGet": {
"path": "/status",
"port": "dashboard"
},
"initialDelaySeconds": 60,
"periodSeconds": 30
},
"podSecurityContext": {
"fsGroup": 1000,
"fsGroupChangePolicy": "Always",
"runAsGroup": 1000,
"runAsUser": 1000,
"supplementalGroups": [
1000
]
},
"readinessProbe": {
"failureThreshold": 12,
"httpGet": {
"path": "/status",
"port": "dashboard"
},
"initialDelaySeconds": 10,
"periodSeconds": 5
},
"replicas": 3,
"resources": {},
"volumeClaimTemplates": {
"resources": {}
}
}
},
"image": "public.ecr.aws/emqx/emqx:5.6.0",
"revisionHistoryLimit": 3,
"updateStrategy": {
"evacuationStrategy": {
"connEvictRate": 1000,
"sessEvictRate": 1000,
"waitTakeover": 10
},
"initialDelaySeconds": 10,
"type": "Recreate"
}
},
"status": {
"conditions": [
{
"lastTransitionTime": "2024-04-09T18:11:02Z",
"message": "Create new statefulSet",
"reason": "CreateNewStatefulSet",
"status": "True",
"type": "CoreNodesProgressing"
}
],
"coreNodes": [
{
"controllerUID": "c3c2f68a-8a30-4542-97d7-f356024863fe",
"edition": "Opensource",
"node": "emqx@emqx5-core-6796d44f-0.emqx5-headless.database.svc.cluster.local",
"node_status": "running",
"otp_release": "25.3.2-2/13.2.2",
"podUID": "5376b8c9-fc07-48c3-b9a7-2591f6342027",
"role": "core",
"uptime": 45232,
"version": "5.6.0"
}
],
"coreNodesStatus": {
"currentReplicas": 1,
"currentRevision": "6796d44f",
"readyReplicas": 1,
"replicas": 3,
"updateReplicas": 1,
"updateRevision": "6796d44f"
},
"replicantNodesStatus": {}
}
}
Hi @onedr0p check EMQX pod log, I found this: 2024-04-09T18:01:50.996718+00:00 [debug] Ekka(AutoCluster): join result: ignore
, it is means the EMQX application can not found any nodes by DNS server.
Could you please check the DNS server, you can create a ubuntu pod in the EMQX pod namespace and running this command nslookup -type=srv emqx5-headless.database.svc.cluster.local
Hi @Rory-Z, I'm able to replicate this issue as well by using the same configuration and versions as @onedr0p.
I'm guessing that the dns issue is caused by the missing services. Just as @onedr0p mentioned above the only svc I see is for the operator and the webhook. No emqx5-headless
svc.
Follow up: I reverted back to version 2.2.14 of the operator and with that version the exact same cluster configuration works. The services are created and the cluster quickly becomes Ready
.
After that I can bump the version back to 2.2.19 and since the services has already been created by the old version of the operator everything still works. But deploying a new cluster fails again.
@ahinko @onedr0p Thanks for feedback, I got the same result, I think this is a bug from 2.2.19, let me fix it.
@onedr0p @ahinko EMQX operator 2.2.20 has been released, please try it.
@Rory-Z 2.2.20 fixes the issue. Thank you for the quick fix.
Thanks for the update @Rory-Z looks to be working here!
Describe the bug
I have deployed emqx operator and tried create a EMQX cluster, and if I set the replicas to 3 the EMQX cluster stays in a
CoreNodesProgressing
state.To Reproduce
EMQX
resource with Flux (might not matter)Running
helm values
cluster definition
operator logs
a replica logs
Expected behavior
For the cluster to be in a
Running
stateAnything else we need to know?:
If I set the core replicas to 1 everything is happy
Environment details::