Closed winsmith closed 8 months ago
Update: I just switched from discovery via druid-kubernetes-extension to discovery via Zookeeper and it works. I think I'll take the hit of running a few Zookeeper pods if that means this issue is not cropping up.
If anyone has any comments or tips on how to resolve the original issue, I'd love that, otherwise I'll close the issue in an few days.
I'm having the same problem. After each complete execution of a Peon, it says this is the last log of MidlleManager:
completed with status [SUCCESS]
And after that any new task stays in PENDING status as if there were no workers available:
This is the last full log in a while for MiddleManager
druid-druid-cluster-middlemanagers 2024-02-22T12:47:13,806 INFO [WorkerTaskManager-NoticeHandler] org.apache.druid.indexing.worker.WorkerTaskManager - Task [index_kafka_telemetry_bfa96517080dc71_dllebiic] completed with status [SUCCESS]
Afterwards I can see log lines like this in coordinator:
druid-druid-cluster-coordinators 2024-02-22T12:50:27,589 ERROR [org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcherpeon] org.apache.druid.discovery.BaseNodeRoleWatcher - Noticed disappearance of unknown druid no de [http://10.244.6.147:8101] of role [peon].
druid-druid-cluster-coordinators 2024-02-22T12:50:28,072 ERROR [HttpServerInventoryView-2] org.apache.druid.server.coordination.ChangeRequestHttpSyncer - Sync failed for server[http://10.244.6.147:8100/] while [Handling response with co de[503], description[Service Unavailable]]. Failed [10] times in the last [219] seconds.. Try restarting the Druid process on server[http://10.244.6.147:8100/].: {exceptionType=org.apache.druid.java.util.common.ISE, exceptionMessage=Rec eived sync response [503], class=org.apache.druid.server.coordination.ChangeRequestHttpSyncer} (org.apache.druid.java.util.common.ISE: Received sync response [503])
The middleManager pod status is healthy, so the pod won't restart on their own. If I restart manually the container or kill the pod (and a new one is created), after some seconds the task is moved from PENDING to RUNNING.
I'm using version 29.0.0 of the containers in AKS. I'm also using druid-kubernetes-extension, and I'm thinking in switching to Zookeeper to test if the problem is there.
Hi. I'm currently evaluating running Druid on a Kubernetes Cluster instead of bare metal. Besides druid-operator I use druid-kubernetes-extensions instead of Zookeeper.
I currently have 1 coordinator and router, and 2 brokers, 2 historicals and 2 middlemanagers set up. I'm pretty sure I'm going to scale this up later – right now my objective is to load my data into this cluster and see where it needs more performance. I ingest data using Kafka.
The Problem
Every time I start an ingestion task, right when the task finishes, the MiddleManager that runs the task instead vanishes from the "Services" list. It's just gone. The MiddleManager is no longer available for the overlord to run tasks on. The fact that the task has completed never reaches the overlord, and instead the overlord kills the task after a while as failed (which, curiously enough, reaches the MiddleManager). However, the logs on the middiemanager show no errors and indicate that the task has been completed.
How do I start debugging this? I didn't find any error messages to google, and the fact that there are no errors in the middlemanagers makes it hard to find a specific clue.
MiddleManager Configuration
MiddleManager Logs