High kafka partition count causing lag metrics to be dropped

Setup Source: a single Kafka topic Partitions: 400 Metrics Emitter: StatsDEmitter Issue: Overlord dropping metrics

Description

We have a druid setup which consumes from a kafka topic with approx 400 partitions. We upgraded to Druid 25 recently and enabled the ingest/kafka/partitionLag metric. Since then we have noticed that

ingest/kafka/partitionLag
ingest/kafka/maxLag
ingest/notices/queueSize

are getting dropped frequently at overlord. We haven't seen the issue with any other metrics.

We are using StatsDEmitter to send the metrics and they eventually end up in datadog. We ruled out all the sources where the metrics can get dropped. We are relying on dogstatsd.client.packets_dropped to see if the metrics are getting dropped. The telemetry metrics available in the StatsDProcessor do not have any tags to correctly associate the dropped packets to a specific node. Attaching a screenshot

ingest/kafka/maxLag is crucial to us as we rely on it extensively for alerting. We use ingest/kafka/partitionLag to identify the partitions that lag the most.

The metrics are not dropped if we disable the partitionLag metric. The high no of partitions is causing some of the metrics to be dropped in the StatsDSender as the out bound queue becomes full frequently as indicated in the above screenshot.

Proposal

Pass on the tags available in StatsDEmitter to telemetry metrics in StatsDSender.
FEmit partitionLag in a new ScheduledExecutorService with a configurable emissionPeriod in SeekableStreamSupervisor. Put a random delay between each emit so that the total delay is less than emissionPeriod.

We can also emit only the top n lagging partitions but that does not produce a continuous timeseries graph hence not preferred.

apache / druid

High kafka partition count causing lag metrics to be dropped #15655