Open karthikgurram87 opened 9 months ago
This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.
Setup Source: a single Kafka topic Partitions: 400 Metrics Emitter: StatsDEmitter Issue: Overlord dropping metrics
Description
We have a druid setup which consumes from a kafka topic with approx 400 partitions. We upgraded to Druid 25 recently and enabled the
ingest/kafka/partitionLag
metric. Since then we have noticed thatingest/kafka/partitionLag
ingest/kafka/maxLag
ingest/notices/queueSize
are getting dropped frequently at overlord. We haven't seen the issue with any other metrics.
We are using
StatsDEmitter
to send the metrics and they eventually end up in datadog. We ruled out all the sources where the metrics can get dropped. We are relying ondogstatsd.client.packets_dropped
to see if the metrics are getting dropped. The telemetry metrics available in theStatsDProcessor
do not have any tags to correctly associate the dropped packets to a specific node. Attaching a screenshotingest/kafka/maxLag
is crucial to us as we rely on it extensively for alerting. We useingest/kafka/partitionLag
to identify the partitions that lag the most.The metrics are not dropped if we disable the partitionLag metric. The high no of partitions is causing some of the metrics to be dropped in the StatsDSender as the out bound queue becomes full frequently as indicated in the above screenshot.
Proposal
StatsDEmitter
to telemetry metrics in StatsDSender.ScheduledExecutorService
with a configurable emissionPeriod inSeekableStreamSupervisor
. Put a random delay between each emit so that the total delay is less than emissionPeriod.We can also emit only the top n lagging partitions but that does not produce a continuous timeseries graph hence not preferred.