Open acherla opened 1 year ago
Wow. that's a pretty impressive scale.
I would definitely recommend upgrading to 25.0. However, that alone won't be sufficient. There are some config changes required to make the best use of optimizations that are added in 25.0.
Switching to round-robin assignment for segment assignment will definitely help with long handoff time (https://github.com/apache/druid/pull/13367). That strategy is not available in 24.0 release. You should also turn on
chatAsync (https://github.com/apache/druid/pull/13491/files)
I have seen these settings work well in large scale clusters and would likely be turned on by default in future releases.
Also tagging @kfaraz and @AmatyaAvadhanula if they have some more suggestions.
coordinator/time
metric)ingest/segment/count
and segment/assigned/count
look like over intervals of 5 minutes? 2nd question is even if a task does not complete within the completionTimeout window I am assuming the segment isnt just dropped, since internally in the logs the task is marked as successfull but the segment will likely not be served for queries until the handoff internally is completed by the coordinator -> historical (meaning the peon will not serve requests either for that segment as the task has been terminated once completionTimeout window has elapsed). Is this a correct assumption?
Yes, the segments aren't dropped and will eventually be served by the historicals
Do you use cachingCost to accelerate balance calculation?
cachingCost can help if both run rules (loading - dropping) and balancing are slow. If balancing is significantly slow while run rules are not, segment selection for balancing is likely the culprit. Please try using batched segment sampling in the coordinator dynamic config in that case
cachingCost can help if both run rules (loading - dropping) and balancing are slow. If balancing is significantly slow while run rules are not, segment selection for balancing is likely the culprit. Please try using batched segment sampling in the coordinator dynamic config in that case
Caching cost was configured on the onset, ultimately lowering percentOfSegmentsToConsiderPerMove yielded the best benefit, potentially at the cost of hot spots on the cluster.
Either way we will be upgrading to 25.0.0 by next week as we've seen the performance significantly improve after some of the new optimizations were pushed in.
Re-opening this just to provide updates to folks who are running large druid clusters. This thread might help them
So we ran a few performance tests here based on 25.0.0 and we are finding some interesting results.
We are currently exploring three options:
@AmatyaAvadhanula any other tips you might be able to recommend.
Coordinator Configuration for Reference:
druid.service=druid/coordinator
druid.plaintextPort=8081
compactionTaskSlotRatio=.50
druid.global.http.numConnections=200
druid.global.http.numMaxThreads=200
druid.serverview.type=http
druid.coordinator.loadqueuepeon.http.batchSize=11
druid.coordinator.loadqueuepeon.type=http
druid.coordinator.kill.on=true
druid.coordinator.kill.durationToRetain=P25D
druid.coordinator.kill.maxSegments=10000
druid.coordinator.kill.period=PT2100S
druid.coordinator.kill.pendingSegments.on=true
druid.coordinator.period.indexingPeriod=PT1800S
druid.coordinator.balancer.strategy=cachingCost
druid.coordinator.balancer.cachingCost.awaitInitialization=false
# druid.manager.segments.pollDuration=PT30S
# druid.manager.rules.pollDuration=PT30S
druid.indexer.storage.recentlyFinishedThreshold=PT24H
druid.indexer.runner.maxPercentageBlacklistWorkers=3
druid.supervisor.storeStackTrace=true
druid.supervisor.idleConfig.enabled=false
druid.coordinator.kill.rule.on=false
# Run the overlord service in the coordinator process
druid.coordinator.asOverlord.enabled=true
druid.coordinator.asOverlord.overlordService=druid/overlord
druid.indexer.tasklock.batchSegmentAllocation=true
# druid.indexer.runner.taskAssignmentTimeout=PT15M
druid.indexer.runner.taskCleanupTimeout=PT15M
druid.indexer.queue.startDelay=PT5S
druid.indexer.runner.type=httpRemote
druid.indexer.runner.maxZnodeBytes=10000000
druid.indexer.storage.type=metadata
druid.indexer.logs.type=s3
druid.indexer.logs.kill.enabled=true
druid.indexer.logs.kill.durationToRetain=86400000
druid.indexer.logs.kill.delay=3600000
druid.coordinator.kill.audit.on=true
druid.coordinator.kill.audit.period=PT1H
druid.coordinator.kill.audit.durationToRetain=P1D
druid.indexer.runner.pendingTasksRunnerNumThreads=100
druid.indexer.runner.maxRetriesBeforeBlacklist=99999
druid.monitoring.monitors=["org.apache.druid.client.cache.CacheMonitor","org.apache.druid.java.util.metrics.JvmMonitor","org.apache.druid.java.util.metrics.JvmThreadsMonitor","org.apache.druid.server.metrics.EventReceiverFirehoseMonitor","org.apache.druid.server.metrics.TaskCountStatsMonitor"]
Middlemanager/Peon Configuration
druid.service=druid/middleManager
druid.plaintextPort=8091
# Number of tasks per middleManager
druid.worker.capacity={{(ansible_processor_vcpus * 0.40)|round|int}}
# Task launch parameters
druid.indexer.runner.javaOptsArray=["-server", "-javaagent:{{druid_install_dir}}/apache-druid-{{druid_version}}/utils/opentelemetry-javaagent.jar", "-Dotel.resource.attributes=service.name=gulv.org.apache.druid.{{env}}.peon","-Xms1g", "-Xmx16g", "-XX:+UseG1GC", "-XX:MaxDirectMemorySize=32g", "-Daws.region=nswest", "-Duser.timezone=UTC", "-Dfile.encoding=UTF-8", "-XX:+ExitOnOutOfMemoryError", "-XX:ParallelGCThreads=2", "-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager"]
druid.indexer.runner.maxZnodeBytes=10024KiB
druid.indexer.runner.startPort=40001
# Processing threads and buffers on Peons
druid.indexer.fork.property.druid.processing.numMergeBuffers=8
druid.indexer.fork.property.druid.processing.buffer.sizeBytes=1000MiB
druid.indexer.fork.property.druid.processing.numThreads=5
druid.indexer.fork.property.druid.emitter=noop
druid.emitter=prometheus
druid.processing.numThreads=5
druid.processing.numMergeBuffers=8
druid.indexer.logs.kill.enabled=true
druid.indexer.logs.kill.durationToRetain=43200000
druid.indexer.logs.kill.delay=3600000
druid.indexer.task.baseTaskDir=/druid/middlemanager/druid/task
# Hadoop indexing
#druid.indexer.task.hadoopWorkingPath=/druid/middlemanager/druid/hadoop-tmp
druid.query.groupBy.maxOnDiskStorage=1000000000000
druid.realtime.cache.useCache=false
druid.realtime.cache.populateCache=false
druid.realtime.cache.maxEntrySize=4000000
druid.monitoring.monitors=["org.apache.druid.client.cache.CacheMonitor","org.apache.druid.java.util.metrics.JvmMonitor","org.apache.druid.java.util.metrics.JvmThreadsMonitor","org.apache.druid.server.metrics.EventReceiverFirehoseMonitor"]
The chat requests are going from Overlord to the workers. Surprising that it's causing Overlord CPU to become high. When you see spike in Overlord CPU usage, can you check through thread dumps where are Overlord threads spending their time? Maybe it is something else.
We have also seen segment allocation cause significant overhead at that scale. This overhead can be reduced by enabling batch segment allocation. druid.indexer.tasklock.batchSegmentAllocation
needs to be set to true
on Overlord.
The chat requests are going from Overlord to the workers. Surprising that it's causing Overlord CPU to become high. When you see spike in Overlord CPU usage, can you check through thread dumps where are Overlord threads spending their time? Maybe it is something else.
We have also seen segment allocation cause significant overhead at that scale. This overhead can be reduced by enabling batch segment allocation.
druid.indexer.tasklock.batchSegmentAllocation
needs to be set totrue
on Overlord.
Thanks we've enabled batchSegmentAllocation on the overlord since the 25.0 upgrade and increased the batchsegment interval to 10 seconds, however we still notice massive spikes in kafka lag ingest time due to segment handoff times due to the overlord failing to respond when an influx of tasks complete at the same time. In this situation well over 1K+ kafka supervisor tasks.
Attached a copy of the flamegraph we took from the overlord during this period when it stalls out
Edit: Providing more context to the above, the instability is actually being caused due to having about 100 supervisors (2K Peons) that each have a 4 hour task duration (due to long intermediaryPersistPeriods). This is resulting in around 35K to 40K segments getting handed off in a very short period of time every 4 hours.
Those handoffs shouldn't cause overlord to stall. Looking at the flame graphs, the supervisor threads seem to be busy on db calls. I would recommend increasing the number of connections to db and see if that resolves the issues you are seeing.
@abhishekagarwal87 Hi I'm a bit late for this thread, but can you please give a reference how to increase number of db connections for metadata storage? (mysql in my case)
@abhishekagarwal87 Hi I'm a bit late for this thread, but can you please give a reference how to increase number of db connections for metadata storage? (mysql in my case)
Missed this but here is an example, the documentation on the druid website is misleading and doesnt really provide accurate instructions for configuring the connection pool using dbcp, and frankly Druid should probably change to hikariCP for connection pooling:
#druid.metadata.storage.connector.dbcp.initialSize=2
#druid.metadata.storage.connector.dbcp.maxTotal=20
#druid.metadata.storage.connector.dbcp.maxIdle=20
#druid.metadata.storage.connector.dbcp.minIdle=2
#druid.metadata.storage.connector.dbcp.url={{druid_metadata_storage_connector_connectURI}}
#druid.metadata.storage.connector.dbcp.username={{druid_metadata_storage_connector_user}}
#druid.metadata.storage.connector.dbcp.password={{druid_metadata_storage_connector_password}}
#druid.metadata.storage.connector.dbcp.driverClassName=com.mysql.jdbc.Driver
Handoff times are still pretty long in druid even in 27.0.0 based on our internal tests and its not due to the DB performance. Longest queries we have are against the druid_segments table from the coordinator taking 3-5 minutes, but thats largely because we have 4Mil+ segments in our cluster (its massive). The problem is just how much metadata druid stores in the metadata DB when querying the full druid_segments table where segments active state = true and when running real time tasks in tandem (which generate a ton of segments in the pendingSegments table). Trimming down the metadata in druid_segments/druid_pendingSegments would go a long way in likely improving the handoff times. Coordinator also seems to slow down when performing large handoffs.
This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.
Hi Druid team, We manage a cluster that has over 3K+ peons (116 middle managers) and 200+ historical nodes (4-5 million segments) with two masters configured as both coordinator/overlord. We run close to around 400+ streaming ingest jobs into druid and we noticed that many of our peon tasks are getting stuck via "Coordinator handoff scheduled - Still waiting for handoff for X segments". This handoff can take anywhere from a few minutes to 30+ minutes depending on the configured completionTimeout period (Job Fails, but its still marked as success in the logs).
In order to optimize our druid cluster we began modifying the following parameters which did help the performance of the coordinator/overlord, but we are still seeing handoff times taking 20-30 minutes at times.
Some of the optimization done in 25.0.0 should assist this via the batch segment allocation parameter, but im wondering prior to 25.0.0 if there are strategies that can be employed to improve the segment handoff time on the coordinator?
2nd question is even if a task does not complete within the completionTimeout window I am assuming the segment isnt just dropped, since internally in the logs the task is marked as successfull but the segment will likely not be served for queries until the handoff internally is completed by the coordinator -> historical (meaning the peon will not serve requests either for that segment as the task has been terminated once completionTimeout window has elapsed). Is this a correct assumption?