Open hardikbajaj opened 1 month ago
I took some thread dumps during when this happens. From stack traces, It looks like there is some kind of a deadlock that happens and intermediateTempExecutor
is stuck WAITING and we wait for it to shut down for 365 days
These are the main task's stack trace for the effected TASK_ID
. These were the threads that contain TASK_ID
in name
Druid indexer tasks sometimes get stuck in PUBLISHING state, due to executors are not shut down properly.
Affected Version
Druid 25.0.0
Description
We are running Kafka supervisor ingestion task, with task replication as two.
PendingCompletionTimeout
minutes, this task is forcefully killed. Since overlord sees Task group completion timeout is passed, and task is not sucess. So it KILLS ACTIVELY READING TASKS.Why Task A2 got stuck in PUBLISHING state?
I did some debugging which is the probable cause of task getting stuck.
2 Jul 2024 @ 03:05:56.224 UTC Shutting down immediately... indexer-pod main thread 2 Jul 2024 @ 03:05:56.258 UTC Dropped segment[S0]. indexer-pod [task_id]-appenderator-persist
Preconditions.checkState( persistExecutor == null || persistExecutor.awaitTermination(365, TimeUnit.DAYS), "persistExecutor not terminated" );
2 Jul 2024 @ 03:40:01.670 UTC Exception caught during execution indexer-pod threading-task-runner-executor-0 java.lang.RuntimeException: org.apache.druid.java.util.common.RE: Current thread is interrupted after [0] tries at org.apache.druid.storage.s3.S3TaskLogs.pushTaskFile(S3TaskLogs.java:156) at org.apache.druid.storage.s3.S3TaskLogs.pushTaskReports(S3TaskLogs.java:141) at org.apache.druid.indexing.overlord.ThreadingTaskRunner$1.call(ThreadingTaskRunner.java:223) at org.apache.druid.indexing.overlord.ThreadingTaskRunner$1.call(ThreadingTaskRunner.java:152) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) Caused by: org.apache.druid.java.util.common.RE: Current thread is interrupted after [0] tries at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:148) at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:81) at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:163) at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:153) at org.apache.druid.storage.s3.S3Utils.retryS3Operation(S3Utils.java:101) at org.apache.druid.storage.s3.S3TaskLogs.pushTaskFile(S3TaskLogs.java:147) ... 7 more