Druid Version 0.22.1 - Long Coordinator handoff times

acherla commented 1 year ago

Hi Druid team, We manage a cluster that has over 3K+ peons (116 middle managers) and 200+ historical nodes (4-5 million segments) with two masters configured as both coordinator/overlord. We run close to around 400+ streaming ingest jobs into druid and we noticed that many of our peon tasks are getting stuck via "Coordinator handoff scheduled - Still waiting for handoff for X segments". This handoff can take anywhere from a few minutes to 30+ minutes depending on the configured completionTimeout period (Job Fails, but its still marked as success in the logs).

In order to optimize our druid cluster we began modifying the following parameters which did help the performance of the coordinator/overlord, but we are still seeing handoff times taking 20-30 minutes at times.

Decreased percentOfSegmentsToConsiderPerMove from 70 percent down to 10 percent.
Modified our tasks to create larger and fewer segments to bring down the overall segment count.
increased the number of pendingSegment threads on the coordinator/overlord from 1 to 10 to handle handing off multiple segments.

Some of the optimization done in 25.0.0 should assist this via the batch segment allocation parameter, but im wondering prior to 25.0.0 if there are strategies that can be employed to improve the segment handoff time on the coordinator?

2nd question is even if a task does not complete within the completionTimeout window I am assuming the segment isnt just dropped, since internally in the logs the task is marked as successfull but the segment will likely not be served for queries until the handoff internally is completed by the coordinator -> historical (meaning the peon will not serve requests either for that segment as the task has been terminated once completionTimeout window has elapsed). Is this a correct assumption?

abhishekagarwal87 commented 1 year ago

Wow. that's a pretty impressive scale.

I would definitely recommend upgrading to 25.0. However, that alone won't be sufficient. There are some config changes required to make the best use of optimizations that are added in 25.0.

Switching to round-robin assignment for segment assignment will definitely help with long handoff time (https://github.com/apache/druid/pull/13367). That strategy is not available in 24.0 release. You should also turn on

druid.indexer.tasklock.batchSegmentAllocation
chatAsync (https://github.com/apache/druid/pull/13491/files)

I have seen these settings work well in large scale clusters and would likely be turned on by default in future releases.

abhishekagarwal87 commented 1 year ago

Also tagging @kfaraz and @AmatyaAvadhanula if they have some more suggestions.

AmatyaAvadhanula commented 1 year ago

Could you also please share how long your coordinator runs take and which duty takes the most time as well? (coordinator/time metric)
Which cost balancer strategy does the cluster use and what is the coordinator dynamic config?
How do the sums of the metrics ingest/segment/count and segment/assigned/count look like over intervals of 5 minutes?

2nd question is even if a task does not complete within the completionTimeout window I am assuming the segment isnt just dropped, since internally in the logs the task is marked as successfull but the segment will likely not be served for queries until the handoff internally is completed by the coordinator -> historical (meaning the peon will not serve requests either for that segment as the task has been terminated once completionTimeout window has elapsed). Is this a correct assumption?

Yes, the segments aren't dropped and will eventually be served by the historicals

happy5035 commented 1 year ago

Do you use cachingCost to accelerate balance calculation?

AmatyaAvadhanula commented 1 year ago

cachingCost can help if both run rules (loading - dropping) and balancing are slow. If balancing is significantly slow while run rules are not, segment selection for balancing is likely the culprit. Please try using batched segment sampling in the coordinator dynamic config in that case

acherla commented 1 year ago

cachingCost can help if both run rules (loading - dropping) and balancing are slow. If balancing is significantly slow while run rules are not, segment selection for balancing is likely the culprit. Please try using batched segment sampling in the coordinator dynamic config in that case

Caching cost was configured on the onset, ultimately lowering percentOfSegmentsToConsiderPerMove yielded the best benefit, potentially at the cost of hot spots on the cluster.

Either way we will be upgrading to 25.0.0 by next week as we've seen the performance significantly improve after some of the new optimizations were pushed in.

acherla commented 1 year ago

Re-opening this just to provide updates to folks who are running large druid clusters. This thread might help them

So we ran a few performance tests here based on 25.0.0 and we are finding some interesting results.

segment handoff time did improve, however enabling chatAsync lead to the Overlord getting flooded with requests that would result in thousands of task failures. We tested this under extreme load (close to 3K Peons) running just Kafka Indexing jobs, it seems likely due to the number of chat requests that are flooding the overlord with requests and the overlord would completely stall out (CPU usage would spike to 80-90% or higher) to the point where it was inaccessible from the management console.

We are currently exploring three options:

Federating the coordinator/overlord to separate machines, right now we are running the active/backup as a coordinator/overlord combo. We are planning to federate out the overlord and coordinator going forward to separate machines
setting druid.indexer.server.maxChatRequests to 1 to throttle the chattiness of peons to the overlord (does this potentially have negative consequences? IE longer handoff times?)
Setting chatAsync to false, we noticed a huge number of task failures due to the overlord throwing jetty connection refused exceptions when it was enabled potentially due to the number of chat requests flooding the overlord due to the number of peons we are running.

@AmatyaAvadhanula any other tips you might be able to recommend.

Coordinator Configuration for Reference:

druid.service=druid/coordinator
druid.plaintextPort=8081

compactionTaskSlotRatio=.50

druid.global.http.numConnections=200
druid.global.http.numMaxThreads=200
druid.serverview.type=http
druid.coordinator.loadqueuepeon.http.batchSize=11
druid.coordinator.loadqueuepeon.type=http
druid.coordinator.kill.on=true
druid.coordinator.kill.durationToRetain=P25D
druid.coordinator.kill.maxSegments=10000
druid.coordinator.kill.period=PT2100S
druid.coordinator.kill.pendingSegments.on=true
druid.coordinator.period.indexingPeriod=PT1800S
druid.coordinator.balancer.strategy=cachingCost
druid.coordinator.balancer.cachingCost.awaitInitialization=false
# druid.manager.segments.pollDuration=PT30S
# druid.manager.rules.pollDuration=PT30S
druid.indexer.storage.recentlyFinishedThreshold=PT24H
druid.indexer.runner.maxPercentageBlacklistWorkers=3
druid.supervisor.storeStackTrace=true
druid.supervisor.idleConfig.enabled=false
druid.coordinator.kill.rule.on=false

# Run the overlord service in the coordinator process
druid.coordinator.asOverlord.enabled=true
druid.coordinator.asOverlord.overlordService=druid/overlord
druid.indexer.tasklock.batchSegmentAllocation=true
# druid.indexer.runner.taskAssignmentTimeout=PT15M
druid.indexer.runner.taskCleanupTimeout=PT15M

druid.indexer.queue.startDelay=PT5S

druid.indexer.runner.type=httpRemote
druid.indexer.runner.maxZnodeBytes=10000000

druid.indexer.storage.type=metadata
druid.indexer.logs.type=s3

druid.indexer.logs.kill.enabled=true
druid.indexer.logs.kill.durationToRetain=86400000
druid.indexer.logs.kill.delay=3600000

druid.coordinator.kill.audit.on=true
druid.coordinator.kill.audit.period=PT1H
druid.coordinator.kill.audit.durationToRetain=P1D

druid.indexer.runner.pendingTasksRunnerNumThreads=100
druid.indexer.runner.maxRetriesBeforeBlacklist=99999

druid.monitoring.monitors=["org.apache.druid.client.cache.CacheMonitor","org.apache.druid.java.util.metrics.JvmMonitor","org.apache.druid.java.util.metrics.JvmThreadsMonitor","org.apache.druid.server.metrics.EventReceiverFirehoseMonitor","org.apache.druid.server.metrics.TaskCountStatsMonitor"]

Middlemanager/Peon Configuration

druid.service=druid/middleManager
druid.plaintextPort=8091

# Number of tasks per middleManager
druid.worker.capacity={{(ansible_processor_vcpus * 0.40)|round|int}}

# Task launch parameters
druid.indexer.runner.javaOptsArray=["-server", "-javaagent:{{druid_install_dir}}/apache-druid-{{druid_version}}/utils/opentelemetry-javaagent.jar", "-Dotel.resource.attributes=service.name=gulv.org.apache.druid.{{env}}.peon","-Xms1g", "-Xmx16g", "-XX:+UseG1GC", "-XX:MaxDirectMemorySize=32g", "-Daws.region=nswest", "-Duser.timezone=UTC", "-Dfile.encoding=UTF-8", "-XX:+ExitOnOutOfMemoryError", "-XX:ParallelGCThreads=2", "-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager"]
druid.indexer.runner.maxZnodeBytes=10024KiB
druid.indexer.runner.startPort=40001

# Processing threads and buffers on Peons
druid.indexer.fork.property.druid.processing.numMergeBuffers=8
druid.indexer.fork.property.druid.processing.buffer.sizeBytes=1000MiB
druid.indexer.fork.property.druid.processing.numThreads=5
druid.indexer.fork.property.druid.emitter=noop

druid.emitter=prometheus
druid.processing.numThreads=5
druid.processing.numMergeBuffers=8

druid.indexer.logs.kill.enabled=true
druid.indexer.logs.kill.durationToRetain=43200000
druid.indexer.logs.kill.delay=3600000

druid.indexer.task.baseTaskDir=/druid/middlemanager/druid/task

# Hadoop indexing
#druid.indexer.task.hadoopWorkingPath=/druid/middlemanager/druid/hadoop-tmp
druid.query.groupBy.maxOnDiskStorage=1000000000000

druid.realtime.cache.useCache=false
druid.realtime.cache.populateCache=false
druid.realtime.cache.maxEntrySize=4000000

druid.monitoring.monitors=["org.apache.druid.client.cache.CacheMonitor","org.apache.druid.java.util.metrics.JvmMonitor","org.apache.druid.java.util.metrics.JvmThreadsMonitor","org.apache.druid.server.metrics.EventReceiverFirehoseMonitor"]

abhishekagarwal87 commented 1 year ago

The chat requests are going from Overlord to the workers. Surprising that it's causing Overlord CPU to become high. When you see spike in Overlord CPU usage, can you check through thread dumps where are Overlord threads spending their time? Maybe it is something else.

We have also seen segment allocation cause significant overhead at that scale. This overhead can be reduced by enabling batch segment allocation. druid.indexer.tasklock.batchSegmentAllocation needs to be set to true on Overlord.

acherla commented 1 year ago

The chat requests are going from Overlord to the workers. Surprising that it's causing Overlord CPU to become high. When you see spike in Overlord CPU usage, can you check through thread dumps where are Overlord threads spending their time? Maybe it is something else.

We have also seen segment allocation cause significant overhead at that scale. This overhead can be reduced by enabling batch segment allocation. druid.indexer.tasklock.batchSegmentAllocation needs to be set to true on Overlord.

Thanks we've enabled batchSegmentAllocation on the overlord since the 25.0 upgrade and increased the batchsegment interval to 10 seconds, however we still notice massive spikes in kafka lag ingest time due to segment handoff times due to the overlord failing to respond when an influx of tasks complete at the same time. In this situation well over 1K+ kafka supervisor tasks.

Attached a copy of the flamegraph we took from the overlord during this period when it stalls out

druid-flame-117758.zip

Edit: Providing more context to the above, the instability is actually being caused due to having about 100 supervisors (2K Peons) that each have a 4 hour task duration (due to long intermediaryPersistPeriods). This is resulting in around 35K to 40K segments getting handed off in a very short period of time every 4 hours.

abhishekagarwal87 commented 1 year ago

Those handoffs shouldn't cause overlord to stall. Looking at the flame graphs, the supervisor threads seem to be busy on db calls. I would recommend increasing the number of connections to db and see if that resolves the issues you are seeing.

IgorBerman commented 1 year ago

@abhishekagarwal87 Hi I'm a bit late for this thread, but can you please give a reference how to increase number of db connections for metadata storage? (mysql in my case)

acherla commented 10 months ago

@abhishekagarwal87 Hi I'm a bit late for this thread, but can you please give a reference how to increase number of db connections for metadata storage? (mysql in my case)

Missed this but here is an example, the documentation on the druid website is misleading and doesnt really provide accurate instructions for configuring the connection pool using dbcp, and frankly Druid should probably change to hikariCP for connection pooling:

#druid.metadata.storage.connector.dbcp.initialSize=2
#druid.metadata.storage.connector.dbcp.maxTotal=20
#druid.metadata.storage.connector.dbcp.maxIdle=20
#druid.metadata.storage.connector.dbcp.minIdle=2
#druid.metadata.storage.connector.dbcp.url={{druid_metadata_storage_connector_connectURI}}
#druid.metadata.storage.connector.dbcp.username={{druid_metadata_storage_connector_user}}
#druid.metadata.storage.connector.dbcp.password={{druid_metadata_storage_connector_password}}
#druid.metadata.storage.connector.dbcp.driverClassName=com.mysql.jdbc.Driver

Handoff times are still pretty long in druid even in 27.0.0 based on our internal tests and its not due to the DB performance. Longest queries we have are against the druid_segments table from the coordinator taking 3-5 minutes, but thats largely because we have 4Mil+ segments in our cluster (its massive). The problem is just how much metadata druid stores in the metadata DB when querying the full druid_segments table where segments active state = true and when running real time tasks in tandem (which generate a ton of segments in the pendingSegments table). Trimming down the metadata in druid_segments/druid_pendingSegments would go a long way in likely improving the handoff times. Coordinator also seems to slow down when performing large handoffs.

github-actions[bot] commented 3 days ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

apache / druid

Druid Version 0.22.1 - Long Coordinator handoff times #13692