apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.48k stars 3.7k forks source link

High kafka partition count causing lag metrics to be dropped #15655

Open karthikgurram87 opened 9 months ago

karthikgurram87 commented 9 months ago

Setup Source: a single Kafka topic Partitions: 400 Metrics Emitter: StatsDEmitter Issue: Overlord dropping metrics

Description

We have a druid setup which consumes from a kafka topic with approx 400 partitions. We upgraded to Druid 25 recently and enabled the ingest/kafka/partitionLag metric. Since then we have noticed that

We are using StatsDEmitter to send the metrics and they eventually end up in datadog. We ruled out all the sources where the metrics can get dropped. We are relying on dogstatsd.client.packets_dropped to see if the metrics are getting dropped. The telemetry metrics available in the StatsDProcessor do not have any tags to correctly associate the dropped packets to a specific node. Attaching a screenshot

Screenshot 2024-01-10 at 3 01 06 PM

ingest/kafka/maxLag is crucial to us as we rely on it extensively for alerting. We use ingest/kafka/partitionLag to identify the partitions that lag the most.

The metrics are not dropped if we disable the partitionLag metric. The high no of partitions is causing some of the metrics to be dropped in the StatsDSender as the out bound queue becomes full frequently as indicated in the above screenshot.

Proposal

  1. Pass on the tags available in StatsDEmitter to telemetry metrics in StatsDSender.
  2. FEmit partitionLag in a new ScheduledExecutorService with a configurable emissionPeriod in SeekableStreamSupervisor. Put a random delay between each emit so that the total delay is less than emissionPeriod.

We can also emit only the top n lagging partitions but that does not produce a continuous timeseries graph hence not preferred.

github-actions[bot] commented 2 weeks ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.