apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.39k stars 3.68k forks source link

Metric taskSlot/idle/count doesn't exclude disabled workers #16771

Open jakubmatyszewski opened 1 month ago

jakubmatyszewski commented 1 month ago

Affected Version

29.0.1

Description

I've noticed that metric taskSlot/idle/count that is part of TaskSlotCountStatsMonitor is not excluding slots from disabled workers. I'm running overlord in httpRemote mode and it seems like this metric should account for it - it calls for getWorkersEligibleToRunTasks() which checks if the worker isEnabled(), but then when I connect to the middlemanager instance I get:

$ curl -X POST localhost:8091/druid/worker/v1/disable
{"druid-middlemanager-default-0.druid-middlemanager.druid-test.svc.cluster.local:8091":"disabled"}

$ curl -s localhost:8000/metrics | grep slot | grep .0
druid_middlemanager_worker_taskslot_total_count{category="__default_worker_category_",druid_service="druid/middleManager",} 3.0
druid_middlemanager_worker_taskslot_used_count{category="__default_worker_category_",druid_service="druid/middleManager",} 0.0
druid_middlemanager_worker_taskslot_idle_count{category="__default_worker_category_",druid_service="druid/middleManager",} 3.0

image

So in the end I'm kinda confused why the metric doesn't show proper value - I guess I'm missing something while reading the code..?

FrankChen021 commented 1 month ago

I think the disabled worker should be excluded from the idle metrics. Maybe we need a more metric such as taskslot_disabled_count