Overlord assigns too many tasks to middle manager

drcrallen commented 8 years ago

RTR has a guard against multiple things trying to launch on the same worker via workersWithUnacknowledgedTask.putIfAbsent(immutableZkWorker.get().getWorker().getHost(), task.getId(). But this is enforced AFTER the check for space on the workers through strategy.findWorkerForTask

As such, it is possible for N tasks (where N is the number of launching threads set by druid.indexer.runner.pendingTasksRunnerNumThreads) to all think that worker W has capacity, and for them ALL to pass the check if the exact wrong kind of race occurs. Which means that a worker can be over-subscribed up to N-1 tasks.

The middle manager logs before sending a SIGINT look like this, with the shutdown handler appearing as the last entry:

2016-06-22T17:45:03,425 INFO [WorkerTaskMonitor] io.druid.indexing.worker.WorkerTaskMonitor - Submitting runnable for task[index_realtime_REDACTED_2016-06-22T18:00:00.000Z_56_0]
2016-06-22T17:45:03,428 INFO [WorkerTaskMonitor] io.druid.indexing.worker.WorkerTaskMonitor - Affirmative. Running task [index_realtime_REDACTED_2016-06-22T18:00:00.000Z_56_0]
2016-06-22T18:30:52,522 WARN [HttpClient-Netty-Worker-2] com.metamx.http.client.NettyHttpClient - [POST REDACTED] Channel disconnected before response complete
2016-06-22T18:30:52,523 WARN [HttpPostEmitter-1-0] com.metamx.http.client.pool.ResourcePool - Resource at key[REDACTED] was returned multiple times?
2016-06-22T18:30:52,524 WARN [HttpPostEmitter-1-0] com.metamx.emitter.core.HttpPostEmitter - Got exception when posting events to urlString[REDACTED]. Resubmitting.
java.util.concurrent.ExecutionException: org.jboss.netty.channel.ChannelException: Channel disconnected
        at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at com.metamx.emitter.core.HttpPostEmitter$EmittingRunnable.run(HttpPostEmitter.java:293) [druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_91]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_91]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_91]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_91]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_91]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_91]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_91]
Caused by: org.jboss.netty.channel.ChannelException: Channel disconnected
        at com.metamx.http.client.NettyHttpClient$1.channelDisconnected(NettyHttpClient.java:311) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:102) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.SimpleChannelUpstreamHandler.channelDisconnected(SimpleChannelUpstreamHandler.java:208) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:102) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.SimpleChannelUpstreamHandler.channelDisconnected(SimpleChannelUpstreamHandler.java:208) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:102) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.handler.codec.replay.ReplayingDecoder.cleanup(ReplayingDecoder.java:570) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.handler.codec.frame.FrameDecoder.channelDisconnected(FrameDecoder.java:365) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:102) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.handler.codec.http.HttpClientCodec.handleUpstream(HttpClientCodec.java:92) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.handler.codec.frame.FrameDecoder.cleanup(FrameDecoder.java:493) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.handler.codec.frame.FrameDecoder.channelDisconnected(FrameDecoder.java:365) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.handler.ssl.SslHandler.channelDisconnected(SslHandler.java:580) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:102) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.Channels.fireChannelDisconnected(Channels.java:396) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.Channels$4.run(Channels.java:386) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.socket.ChannelRunnableWrapper.run(ChannelRunnableWrapper.java:40) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:391) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:315) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        ... 3 more
2016-06-22T18:31:52,526 INFO [HttpPostEmitter-1-0] com.metamx.http.client.pool.ChannelResourceFactory - Generating: REDACTED
2016-06-22T19:25:45,719 INFO [Thread-68] com.metamx.common.lifecycle.Lifecycle - Running shutdown hook

Before firing the shutdown handler, A thread dump was taken, and here are the threads (very booring repeated sections snipped for brevity):

Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.91-b14 mixed mode):

"process reaper" #212 daemon prio=10 os_prio=0 tid=0x00007fc904004000 nid=0xb291 runnable [0x00007fc9c0e5f000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.UNIXProcess.waitForProcessExit(Native Method)
    at java.lang.UNIXProcess.lambda$initStreams$3(UNIXProcess.java:290)
    at java.lang.UNIXProcess$$Lambda$9/744186075.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"process reaper" #211 daemon prio=10 os_prio=0 tid=0x00007fc92c00e800 nid=0xb26b runnable [0x00007fca98070000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.UNIXProcess.waitForProcessExit(Native Method)
    at java.lang.UNIXProcess.lambda$initStreams$3(UNIXProcess.java:290)
    at java.lang.UNIXProcess$$Lambda$9/744186075.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"process reaper" #210 daemon prio=10 os_prio=0 tid=0x00007fc8ec003000 nid=0xb246 runnable [0x00007fc9c0f99000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.UNIXProcess.waitForProcessExit(Native Method)
    at java.lang.UNIXProcess.lambda$initStreams$3(UNIXProcess.java:290)
    at java.lang.UNIXProcess$$Lambda$9/744186075.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"process reaper" #209 daemon prio=10 os_prio=0 tid=0x00007fc8f4001000 nid=0xb22e runnable [0x00007fc9c10d3000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.UNIXProcess.waitForProcessExit(Native Method)
    at java.lang.UNIXProcess.lambda$initStreams$3(UNIXProcess.java:290)
    at java.lang.UNIXProcess$$Lambda$9/744186075.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"process reaper" #208 daemon prio=10 os_prio=0 tid=0x00007fc8fc009000 nid=0xb20f runnable [0x00007fcb6c037000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.UNIXProcess.waitForProcessExit(Native Method)
    at java.lang.UNIXProcess.lambda$initStreams$3(UNIXProcess.java:290)
    at java.lang.UNIXProcess$$Lambda$9/744186075.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"process reaper" #207 daemon prio=10 os_prio=0 tid=0x00007fc91c001000 nid=0xb200 runnable [0x00007fcb7c044000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.UNIXProcess.waitForProcessExit(Native Method)
    at java.lang.UNIXProcess.lambda$initStreams$3(UNIXProcess.java:290)
    at java.lang.UNIXProcess$$Lambda$9/744186075.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"process reaper" #206 daemon prio=10 os_prio=0 tid=0x00007fc914001000 nid=0xb1c5 runnable [0x00007fca98037000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.UNIXProcess.waitForProcessExit(Native Method)
    at java.lang.UNIXProcess.lambda$initStreams$3(UNIXProcess.java:290)
    at java.lang.UNIXProcess$$Lambda$9/744186075.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"process reaper" #205 daemon prio=10 os_prio=0 tid=0x00007fc924001000 nid=0xb1b6 runnable [0x00007fcb7c0b6000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.UNIXProcess.waitForProcessExit(Native Method)
    at java.lang.UNIXProcess.lambda$initStreams$3(UNIXProcess.java:290)
    at java.lang.UNIXProcess$$Lambda$9/744186075.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"process reaper" #204 daemon prio=10 os_prio=0 tid=0x00007fc90c003800 nid=0x9f4e runnable [0x00007fcb7c07d000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.UNIXProcess.waitForProcessExit(Native Method)
    at java.lang.UNIXProcess.lambda$initStreams$3(UNIXProcess.java:290)
    at java.lang.UNIXProcess$$Lambda$9/744186075.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"forking-task-runner-8" #185 daemon prio=5 os_prio=0 tid=0x00007fca40027800 nid=0x32c1 runnable [0x00007fc9c0f5f000]
   java.lang.Thread.State: RUNNABLE
    at java.io.FileInputStream.readBytes(Native Method)
    at java.io.FileInputStream.read(FileInputStream.java:255)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    - locked <0x00000000fe9dbdd8> (a java.lang.UNIXProcess$ProcessPipeInputStream)
    at java.io.FilterInputStream.read(FilterInputStream.java:107)
    at com.google.common.io.ByteStreams.copy(ByteStreams.java:175)
    at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:432)
    at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:219)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"forking-task-runner-7" #183 daemon prio=5 os_prio=0 tid=0x00007fca40024000 nid=0x3285 runnable [0x00007fc9c109a000]
   java.lang.Thread.State: RUNNABLE
    at java.io.FileInputStream.readBytes(Native Method)
    at java.io.FileInputStream.read(FileInputStream.java:255)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    - locked <0x00000000fec4c080> (a java.lang.UNIXProcess$ProcessPipeInputStream)
    at java.io.FilterInputStream.read(FilterInputStream.java:107)
    at com.google.common.io.ByteStreams.copy(ByteStreams.java:175)
    at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:432)
    at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:219)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"forking-task-runner-6" #181 daemon prio=5 os_prio=0 tid=0x00007fca40022800 nid=0x3260 runnable [0x00007fc9c11d3000]
   java.lang.Thread.State: RUNNABLE
    at java.io.FileInputStream.readBytes(Native Method)
    at java.io.FileInputStream.read(FileInputStream.java:255)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    - locked <0x00000000fec10668> (a java.lang.UNIXProcess$ProcessPipeInputStream)
    at java.io.FilterInputStream.read(FilterInputStream.java:107)
    at com.google.common.io.ByteStreams.copy(ByteStreams.java:175)
    at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:432)
    at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:219)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"forking-task-runner-5" #179 daemon prio=5 os_prio=0 tid=0x00007fca4001b800 nid=0x3239 runnable [0x00007fc9c12d4000]
   java.lang.Thread.State: RUNNABLE
    at java.io.FileInputStream.readBytes(Native Method)
    at java.io.FileInputStream.read(FileInputStream.java:255)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    - locked <0x00000000fdea6f80> (a java.lang.UNIXProcess$ProcessPipeInputStream)
    at java.io.FilterInputStream.read(FilterInputStream.java:107)
    at com.google.common.io.ByteStreams.copy(ByteStreams.java:175)
    at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:432)
    at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:219)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"forking-task-runner-4" #177 daemon prio=5 os_prio=0 tid=0x00007fca40019000 nid=0x3227 runnable [0x00007fc9c13d5000]
   java.lang.Thread.State: RUNNABLE
    at java.io.FileInputStream.readBytes(Native Method)
    at java.io.FileInputStream.read(FileInputStream.java:255)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    - locked <0x00000000feebe2f0> (a java.lang.UNIXProcess$ProcessPipeInputStream)
    at java.io.FilterInputStream.read(FilterInputStream.java:107)
    at com.google.common.io.ByteStreams.copy(ByteStreams.java:175)
    at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:432)
    at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:219)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"forking-task-runner-3" #175 daemon prio=5 os_prio=0 tid=0x00007fca40017800 nid=0x3203 runnable [0x00007fc9c14d7000]
   java.lang.Thread.State: RUNNABLE
    at java.io.FileInputStream.readBytes(Native Method)
    at java.io.FileInputStream.read(FileInputStream.java:255)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    - locked <0x00000000fec16040> (a java.lang.UNIXProcess$ProcessPipeInputStream)
    at java.io.FilterInputStream.read(FilterInputStream.java:107)
    at com.google.common.io.ByteStreams.copy(ByteStreams.java:175)
    at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:432)
    at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:219)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"forking-task-runner-2" #173 daemon prio=5 os_prio=0 tid=0x00007fca40013000 nid=0x31f0 runnable [0x00007fc9c15d7000]
   java.lang.Thread.State: RUNNABLE
    at java.io.FileInputStream.readBytes(Native Method)
    at java.io.FileInputStream.read(FileInputStream.java:255)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    - locked <0x00000000feba2078> (a java.lang.UNIXProcess$ProcessPipeInputStream)
    at java.io.FilterInputStream.read(FilterInputStream.java:107)
    at com.google.common.io.ByteStreams.copy(ByteStreams.java:175)
    at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:432)
    at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:219)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"forking-task-runner-1" #171 daemon prio=5 os_prio=0 tid=0x00007fca40011800 nid=0x2fd7 runnable [0x00007fc9c16d8000]
   java.lang.Thread.State: RUNNABLE
    at java.io.FileInputStream.readBytes(Native Method)
    at java.io.FileInputStream.read(FileInputStream.java:255)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    - locked <0x00000000fe9db928> (a java.lang.UNIXProcess$ProcessPipeInputStream)
    at java.io.FilterInputStream.read(FilterInputStream.java:107)
    at com.google.common.io.ByteStreams.copy(ByteStreams.java:175)
    at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:432)
    at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:219)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"forking-task-runner-0" #169 daemon prio=5 os_prio=0 tid=0x00007fca40007800 nid=0x2fb0 runnable [0x00007fc9c17d9000]
   java.lang.Thread.State: RUNNABLE
    at java.io.FileInputStream.readBytes(Native Method)
    at java.io.FileInputStream.read(FileInputStream.java:255)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    - locked <0x00000000fec3de28> (a java.lang.UNIXProcess$ProcessPipeInputStream)
    at java.io.FilterInputStream.read(FilterInputStream.java:107)
    at com.google.common.io.ByteStreams.copy(ByteStreams.java:175)
    at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:432)
    at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:219)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"JettyScheduler" #167 daemon prio=5 os_prio=0 tid=0x00007fcc662ca000 nid=0x2fa3 waiting on condition [0x00007fc9c1cdb000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000fdd6c408> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"qtp1041611526-166" #166 daemon prio=5 os_prio=0 tid=0x00007fcc662bc800 nid=0x2fa2 waiting on condition [0x00007fc9c1ddc000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000fde17c58> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
    at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:389)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:516)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.access$700(QueuedThreadPool.java:47)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:575)
    at java.lang.Thread.run(Thread.java:745)

"qtp1041611526-165" #165 daemon prio=5 os_prio=0 tid=0x00007fcc662ba800 nid=0x2fa1 waiting on condition [0x00007fc9c1edd000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000fde17c58> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
    at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:389)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:516)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.access$700(QueuedThreadPool.java:47)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:575)
    at java.lang.Thread.run(Thread.java:745)

........... SNIP ...........

"qtp1041611526-110" #110 daemon prio=5 os_prio=0 tid=0x00007fcc66243800 nid=0x2f6a waiting on condition [0x00007fca99686000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000fde17c58> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
    at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:389)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:516)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.access$700(QueuedThreadPool.java:47)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:575)
    at java.lang.Thread.run(Thread.java:745)

"qtp1041611526-109" #109 daemon prio=5 os_prio=0 tid=0x00007fcc66241800 nid=0x2f69 waiting on condition [0x00007fca99787000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000fde17c58> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
    at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:389)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:516)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.access$700(QueuedThreadPool.java:47)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:575)
    at java.lang.Thread.run(Thread.java:745)

"qtp1041611526-108-acceptor-3@45b53891-ServerConnector@4aa11206{HTTP/1.1}{0.0.0.0:8080}" #108 daemon prio=4 os_prio=0 tid=0x00007fcc66240000 nid=0x2f68 waiting for monitor entry [0x00007fca99888000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:234)
    - waiting to lock <0x00000000fea89390> (a java.lang.Object)
    at org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:377)
    at org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:500)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:620)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:540)
    at java.lang.Thread.run(Thread.java:745)

"qtp1041611526-107-acceptor-2@4cc9b5b1-ServerConnector@4aa11206{HTTP/1.1}{0.0.0.0:8080}" #107 daemon prio=4 os_prio=0 tid=0x00007fcc6623e800 nid=0x2f67 waiting for monitor entry [0x00007fca99989000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:234)
    - waiting to lock <0x00000000fea89390> (a java.lang.Object)
    at org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:377)
    at org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:500)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:620)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:540)
    at java.lang.Thread.run(Thread.java:745)

"qtp1041611526-106-acceptor-1@4b48e57b-ServerConnector@4aa11206{HTTP/1.1}{0.0.0.0:8080}" #106 daemon prio=4 os_prio=0 tid=0x00007fcc6623c800 nid=0x2f66 runnable [0x00007fca99a8a000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
    at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
    at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
    - locked <0x00000000fea89390> (a java.lang.Object)
    at org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:377)
    at org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:500)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:620)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:540)
    at java.lang.Thread.run(Thread.java:745)

"qtp1041611526-105-acceptor-0@3db8592a-ServerConnector@4aa11206{HTTP/1.1}{0.0.0.0:8080}" #105 daemon prio=4 os_prio=0 tid=0x00007fcc6623b000 nid=0x2f65 waiting for monitor entry [0x00007fca99b8b000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:234)
    - waiting to lock <0x00000000fea89390> (a java.lang.Object)
    at org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:377)
    at org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:500)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:620)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:540)
    at java.lang.Thread.run(Thread.java:745)

"qtp1041611526-104-selector-ServerConnectorManager@5513435d/3" #104 daemon prio=5 os_prio=0 tid=0x00007fcc66239000 nid=0x2f64 runnable [0x00007fca99c8c000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
    - locked <0x00000000fecdcd90> (a sun.nio.ch.Util$2)
    - locked <0x00000000fecdcd78> (a java.util.Collections$UnmodifiableSet)
    - locked <0x00000000fec855a8> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
    at org.eclipse.jetty.io.SelectorManager$ManagedSelector.select(SelectorManager.java:596)
    at org.eclipse.jetty.io.SelectorManager$ManagedSelector.run(SelectorManager.java:545)
    at org.eclipse.jetty.util.thread.NonBlockingThread.run(NonBlockingThread.java:52)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:620)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:540)
    at java.lang.Thread.run(Thread.java:745)

"qtp1041611526-103-selector-ServerConnectorManager@5513435d/2" #103 daemon prio=5 os_prio=0 tid=0x00007fcc66237800 nid=0x2f63 runnable [0x00007fca99d8d000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
    - locked <0x00000000fde2e648> (a sun.nio.ch.Util$2)
    - locked <0x00000000fde2e630> (a java.util.Collections$UnmodifiableSet)
    - locked <0x00000000fed04298> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
    at org.eclipse.jetty.io.SelectorManager$ManagedSelector.select(SelectorManager.java:596)
    at org.eclipse.jetty.io.SelectorManager$ManagedSelector.run(SelectorManager.java:545)
    at org.eclipse.jetty.util.thread.NonBlockingThread.run(NonBlockingThread.java:52)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:620)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:540)
    at java.lang.Thread.run(Thread.java:745)

"qtp1041611526-102-selector-ServerConnectorManager@5513435d/1" #102 daemon prio=5 os_prio=0 tid=0x00007fcc6628a000 nid=0x2f62 runnable [0x00007fca99e8e000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
    - locked <0x00000000fecfdb30> (a sun.nio.ch.Util$2)
    - locked <0x00000000fecfdb18> (a java.util.Collections$UnmodifiableSet)
    - locked <0x00000000fecfebb0> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
    at org.eclipse.jetty.io.SelectorManager$ManagedSelector.select(SelectorManager.java:596)
    at org.eclipse.jetty.io.SelectorManager$ManagedSelector.run(SelectorManager.java:545)
    at org.eclipse.jetty.util.thread.NonBlockingThread.run(NonBlockingThread.java:52)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:620)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:540)
    at java.lang.Thread.run(Thread.java:745)

"qtp1041611526-101-selector-ServerConnectorManager@5513435d/0" #101 daemon prio=5 os_prio=0 tid=0x00007fcc66286000 nid=0x2f61 runnable [0x00007fca99f8f000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
    - locked <0x00000000fecfa770> (a sun.nio.ch.Util$2)
    - locked <0x00000000fecfa758> (a java.util.Collections$UnmodifiableSet)
    - locked <0x00000000fecf0d10> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
    at org.eclipse.jetty.io.SelectorManager$ManagedSelector.select(SelectorManager.java:596)
    at org.eclipse.jetty.io.SelectorManager$ManagedSelector.run(SelectorManager.java:545)
    at org.eclipse.jetty.util.thread.NonBlockingThread.run(NonBlockingThread.java:52)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:620)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:540)
    at java.lang.Thread.run(Thread.java:745)

"WorkerTaskMonitor" #99 daemon prio=5 os_prio=0 tid=0x00007fcc6627a800 nid=0x2f5f waiting on condition [0x00007fca9a090000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000fd5c4d18> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)
    at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680)
    at io.druid.indexing.worker.WorkerTaskMonitor.mainLoop(WorkerTaskMonitor.java:142)
    at io.druid.indexing.worker.WorkerTaskMonitor.access$000(WorkerTaskMonitor.java:58)
    at io.druid.indexing.worker.WorkerTaskMonitor$1.run(WorkerTaskMonitor.java:119)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"TaskMonitorCache-0" #98 daemon prio=5 os_prio=0 tid=0x00007fcc66279000 nid=0x2f5e waiting on condition [0x00007fca9a191000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000fdf93148> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"Curator-Framework-0" #97 daemon prio=5 os_prio=0 tid=0x00007fcc661fe000 nid=0x2f5d waiting on condition [0x00007fca9a492000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000fded8788> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.DelayQueue.take(DelayQueue.java:211)
    at java.util.concurrent.DelayQueue.take(DelayQueue.java:70)
    at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:804)
    at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)
    at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"main-EventThread" #96 daemon prio=5 os_prio=0 tid=0x00007fcc661ed000 nid=0x2f5c waiting on condition [0x00007fca9a593000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000fdc55ec0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)

"main-SendThread(ip-172-19-2-14.ec2.internal:2181)" #95 daemon prio=5 os_prio=0 tid=0x00007fcc661ec000 nid=0x2f5b runnable [0x00007fca9a694000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
    - locked <0x00000000fd63b9d8> (a sun.nio.ch.Util$2)
    - locked <0x00000000fd63b9c0> (a java.util.Collections$UnmodifiableSet)
    - locked <0x00000000fe0b0440> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
    at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:349)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)

"Curator-ConnectionStateManager-0" #94 daemon prio=5 os_prio=0 tid=0x00007fcc661e6000 nid=0x2f5a waiting on condition [0x00007fca9abb6000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000fd70df20> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403)
    at org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:245)
    at org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
    at org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:111)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"MonitorScheduler-0" #93 daemon prio=5 os_prio=0 tid=0x00007fcc661e4800 nid=0x2f59 waiting on condition [0x00007fca9acb7000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000fdafedf0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"HttpPostEmitter-1-0" #92 daemon prio=5 os_prio=0 tid=0x00007fcc661e2000 nid=0x2f58 waiting on condition [0x00007fca9adb8000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000fd7529c8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"HttpClient-Timer-0" #26 daemon prio=5 os_prio=0 tid=0x00007fcc661ce800 nid=0x2f57 waiting on condition [0x00007fca9aeb9000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at org.jboss.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445)
    at org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364)
    at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
    at java.lang.Thread.run(Thread.java:745)

"HttpClient-Netty-Worker-63" #91 daemon prio=5 os_prio=0 tid=0x00007fcc660fc800 nid=0x2f56 runnable [0x00007fca9b2f1000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
    - locked <0x00000000fe097a80> (a sun.nio.ch.Util$2)
    - locked <0x00000000fe097a98> (a java.util.Collections$UnmodifiableSet)
    - locked <0x00000000fd5b50e0> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
    at org.jboss.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:68)
    at org.jboss.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:434)
    at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:212)
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
    at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
    at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
    at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"HttpClient-Netty-Worker-62" #90 daemon prio=5 os_prio=0 tid=0x00007fcc660d1800 nid=0x2f55 runnable [0x00007fca9b3f2000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
    - locked <0x00000000fd6effd8> (a sun.nio.ch.Util$2)
    - locked <0x00000000fd6eff00> (a java.util.Collections$UnmodifiableSet)
    - locked <0x00000000fdd75388> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
    at org.jboss.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:68)
    at org.jboss.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:434)
    at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:212)
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
    at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
    at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
    at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

.......... SNIP ......

"HttpClient-Netty-Worker-0" #28 daemon prio=5 os_prio=0 tid=0x00007fcc6565f000 nid=0x2f17 runnable [0x00007fcb6f4a0000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
    - locked <0x00000000fe04eb08> (a sun.nio.ch.Util$2)
    - locked <0x00000000fe04eb20> (a java.util.Collections$UnmodifiableSet)
    - locked <0x00000000fdb2c7f8> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
    at org.jboss.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:68)
    at org.jboss.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:434)
    at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:212)
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
    at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
    at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
    at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"HttpClient-Netty-Boss-0" #27 daemon prio=5 os_prio=0 tid=0x00007fcc65637800 nid=0x2f16 runnable [0x00007fcb6f5a1000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
    - locked <0x00000000fde9f750> (a sun.nio.ch.Util$2)
    - locked <0x00000000fde9f768> (a java.util.Collections$UnmodifiableSet)
    - locked <0x00000000fdb2cba0> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
    at org.jboss.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:68)
    at org.jboss.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:434)
    at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:212)
    at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
    at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
    at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"RMI TCP Accept-0" #25 daemon prio=5 os_prio=0 tid=0x00007fcc644ba000 nid=0x2f12 runnable [0x00007fcb7cf25000]
   java.lang.Thread.State: RUNNABLE
    at java.net.PlainSocketImpl.socketAccept(Native Method)
    at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
    at java.net.ServerSocket.implAccept(ServerSocket.java:545)
    at java.net.ServerSocket.accept(ServerSocket.java:513)
    at sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept(LocalRMIServerSocketFactory.java:52)
    at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:400)
    at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:372)
    at java.lang.Thread.run(Thread.java:745)

"RMI TCP Accept-17071" #24 daemon prio=5 os_prio=0 tid=0x00007fcc644b5800 nid=0x2f11 runnable [0x00007fcb7d026000]
   java.lang.Thread.State: RUNNABLE
    at java.net.PlainSocketImpl.socketAccept(Native Method)
    at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
    at java.net.ServerSocket.implAccept(ServerSocket.java:545)
    at java.net.ServerSocket.accept(ServerSocket.java:513)
    at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:400)
    at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:372)
    at java.lang.Thread.run(Thread.java:745)

"RMI TCP Accept-0" #23 daemon prio=5 os_prio=0 tid=0x00007fcc644a8000 nid=0x2f10 runnable [0x00007fcbecdc3000]
   java.lang.Thread.State: RUNNABLE
    at java.net.PlainSocketImpl.socketAccept(Native Method)
    at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
    at java.net.ServerSocket.implAccept(ServerSocket.java:545)
    at java.net.ServerSocket.accept(ServerSocket.java:513)
    at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:400)
    at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:372)
    at java.lang.Thread.run(Thread.java:745)

"Service Thread" #21 daemon prio=9 os_prio=0 tid=0x00007fcc6425a800 nid=0x2f0e runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread14" #20 daemon prio=9 os_prio=0 tid=0x00007fcc6424f000 nid=0x2f0d waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread13" #19 daemon prio=9 os_prio=0 tid=0x00007fcc6424d000 nid=0x2f0c waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread12" #18 daemon prio=9 os_prio=0 tid=0x00007fcc6424b000 nid=0x2f0b waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread11" #17 daemon prio=9 os_prio=0 tid=0x00007fcc64248800 nid=0x2f0a waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread10" #16 daemon prio=9 os_prio=0 tid=0x00007fcc64246800 nid=0x2f09 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread9" #15 daemon prio=9 os_prio=0 tid=0x00007fcc64244800 nid=0x2f08 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread8" #14 daemon prio=9 os_prio=0 tid=0x00007fcc64242000 nid=0x2f07 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread7" #13 daemon prio=9 os_prio=0 tid=0x00007fcc6423f800 nid=0x2f06 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread6" #12 daemon prio=9 os_prio=0 tid=0x00007fcc6423d800 nid=0x2f05 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread5" #11 daemon prio=9 os_prio=0 tid=0x00007fcc6423b800 nid=0x2f04 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread4" #10 daemon prio=9 os_prio=0 tid=0x00007fcc64239000 nid=0x2f03 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread3" #9 daemon prio=9 os_prio=0 tid=0x00007fcc6422f000 nid=0x2f02 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread2" #8 daemon prio=9 os_prio=0 tid=0x00007fcc6422d000 nid=0x2f01 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" #7 daemon prio=9 os_prio=0 tid=0x00007fcc6422a800 nid=0x2f00 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" #6 daemon prio=9 os_prio=0 tid=0x00007fcc64228800 nid=0x2eff waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" #5 daemon prio=9 os_prio=0 tid=0x00007fcc64226800 nid=0x2efe waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Surrogate Locker Thread (Concurrent GC)" #4 daemon prio=9 os_prio=0 tid=0x00007fcc64225000 nid=0x2efd waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007fcc641f2800 nid=0x2efc in Object.wait() [0x00007fcbee312000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
    - locked <0x00000000fd5ddd58> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)

"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007fcc641ee000 nid=0x2efb in Object.wait() [0x00007fcbee413000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:502)
    at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
    - locked <0x00000000fd5b4a38> (a java.lang.ref.Reference$Lock)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)

"main" #1 prio=5 os_prio=0 tid=0x00007fcc6400b000 nid=0x2edb in Object.wait() [0x00007fcc6bf7c000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000000fd782f58> (a java.lang.Thread)
    at java.lang.Thread.join(Thread.java:1245)
    - locked <0x00000000fd782f58> (a java.lang.Thread)
    at java.lang.Thread.join(Thread.java:1319)
    at com.metamx.common.lifecycle.Lifecycle.join(Lifecycle.java:317)
    at io.druid.cli.ServerRunnable.run(ServerRunnable.java:43)
    at io.druid.cli.Main.main(Main.java:105)

"VM Thread" os_prio=0 tid=0x00007fcc641e6000 nid=0x2efa runnable

"Gang worker#0 (Parallel GC Threads)" os_prio=0 tid=0x00007fcc6401c000 nid=0x2edc runnable

.... SNIP ....

"Gang worker#22 (Parallel GC Threads)" os_prio=0 tid=0x00007fcc64042800 nid=0x2ef2 runnable

"Concurrent Mark-Sweep GC Thread" os_prio=0 tid=0x00007fcc6415b000 nid=0x2ef9 runnable

"Gang worker#0 (Parallel CMS Threads)" os_prio=0 tid=0x00007fcc6414f800 nid=0x2ef3 runnable

... SNIP ...

"Gang worker#5 (Parallel CMS Threads)" os_prio=0 tid=0x00007fcc64158000 nid=0x2ef8 runnable

"VM Periodic Task Thread" os_prio=0 tid=0x00007fcc644bc000 nid=0x2f13 waiting on condition

((more in comments))

drcrallen commented 8 years ago

The weird thing is the task instantly launched once the shutdown process started, this makes me wonder if there was some reason it was being blocked from the forking task runner's queue being full.

drcrallen commented 8 years ago

When the middle manager restarted, one task was reporting weird, and the number of tasks was greater than the configured capacity of the node:

2016-06-22T19:45:49,304 INFO [main] io.druid.indexing.overlord.ForkingTaskRunner - Restored 10 tasks.
2016-06-22T19:45:49,321 INFO [main] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_2016-06-22T18:00:00.000Z_54_0] location changed to [TaskLocation{host='REDACTED_HOST', port=8100}].
2016-06-22T19:45:49,322 INFO [main] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_2016-06-22T18:00:00.000Z_53_0] location changed to [TaskLocation{host='REDACTED_HOST', port=8101}].
2016-06-22T19:45:49,322 INFO [main] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_2016-06-22T18:00:00.000Z_56_0] location changed to [TaskLocation{host='REDACTED_HOST', port=8102}].
2016-06-22T19:45:49,322 INFO [main] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_2016-06-22T18:00:00.000Z_55_0] location changed to [TaskLocation{host='REDACTED_HOST', port=8103}].
2016-06-22T19:45:49,323 INFO [main] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_2016-06-22T18:00:00.000Z_50_0] location changed to [TaskLocation{host='REDACTED_HOST', port=8104}].
2016-06-22T19:45:49,323 INFO [main] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_2016-06-22T18:00:00.000Z_52_0] location changed to [TaskLocation{host='REDACTED_HOST', port=8105}].
2016-06-22T19:45:49,323 INFO [main] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_2016-06-22T18:00:00.000Z_51_0] location changed to [TaskLocation{host='REDACTED_HOST', port=8106}].
2016-06-22T19:45:49,323 INFO [main] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED2_2016-06-22T16:00:00.000Z_0_0] location changed to [TaskLocation{host='REDACTED_HOST', port=8107}].
2016-06-22T19:45:49,323 INFO [main] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_2016-06-22T18:00:00.000Z_46_1] location changed to [TaskLocation{host='REDACTED_HOST', port=8108}].
2016-06-22T19:45:49,323 INFO [main] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_2016-06-22T18:00:00.000Z_48_0] location changed to [TaskLocation{host='null', port=-1}].
2016-06-22T19:45:49,323 INFO [main] io.druid.indexing.overlord.ForkingTaskRunner - Registered listener [WorkerTaskMonitor]

drcrallen commented 8 years ago

The task index_realtime_REDACTED_2016-06-22T18:00:00.000Z_48_0 has a log retrievable through the overlord, but is failed in the console:

End of log from weird task:

2016-06-22T19:25:46,341 INFO [REDACTED-incremental-persist] io.druid.segment.ReferenceCountingSegment - Closing REDACTED_2016-06-22T18:00:00.000Z_2016-06-22T19:00:00.000Z_2016-06-22T18:15:02.029Z_48
2016-06-22T19:25:46,342 INFO [REDACTED-incremental-persist] io.druid.segment.ReferenceCountingSegment - Closing REDACTED_2016-06-22T18:00:00.000Z_2016-06-22T19:00:00.000Z_2016-06-22T18:15:02.029Z_48, numReferences: 0
2016-06-22T19:25:46,343 INFO [task-runner-0-priority-0] io.druid.indexing.common.task.RealtimeIndexTask - Gracefully stopping.
2016-06-22T19:25:46,343 INFO [task-runner-0-priority-0] io.druid.indexing.common.task.RealtimeIndexTask - Job done!
2016-06-22T19:25:46,343 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_2016-06-22T18:00:00.000Z_48_0] status changed to [SUCCESS].
2016-06-22T19:25:46,344 INFO [Thread-102] io.druid.indexing.overlord.ThreadPoolTaskRunner - Graceful shutdown of task[index_realtime_REDACTED_2016-06-22T18:00:00.000Z_48_0] finished in 519ms.
2016-06-22T19:25:46,344 INFO [Thread-102] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_2016-06-22T18:00:00.000Z_48_0] status changed to [SUCCESS].
2016-06-22T19:25:46,344 INFO [Thread-102] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[io.druid.curator.discovery.ServerDiscoverySelector@3a8cea24].
2016-06-22T19:25:46,347 INFO [Thread-102] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.curator.announcement.Announcer.stop()] on object[io.druid.curator.announcement.Announcer@5af5d76f].
2016-06-22T19:25:46,347 INFO [Thread-102] io.druid.curator.announcement.Announcer - unannouncing [/druid/compressed/segments/REDACTED_HOST:8102/REDACTED_HOST:8102_realtime__default_tier_2016-06-22T18:15:03.699Z_0b9848bf5398424589e0ca5a9b495a6a0]
2016-06-22T19:25:46,348 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "index_realtime_REDACTED_2016-06-22T18:00:00.000Z_48_0",
  "status" : "SUCCESS",
  "duration" : 6038588
}
2016-06-22T19:25:46,355 INFO [main] io.druid.cli.CliPeon - Thread [Thread[Thread-102,5,main]] is non daemon.
2016-06-22T19:25:46,356 INFO [Thread-102] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[io.druid.curator.discovery.ServerDiscoverySelector@67770b37].
2016-06-22T19:25:46,356 INFO [Thread-102] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void com.metamx.http.client.NettyHttpClient.stop()] on object[com.metamx.http.client.NettyHttpClient@574059d5].
2016-06-22T19:25:46,376 INFO [Thread-102] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void com.metamx.metrics.MonitorScheduler.stop()] on object[com.metamx.metrics.MonitorScheduler@36776c32].
2016-06-22T19:25:46,382 INFO [Thread-102] io.druid.curator.CuratorModule - Stopping Curator
2016-06-22T19:25:46,383 INFO [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl - backgroundOperationsLoop exiting
2016-06-22T19:25:46,387 INFO [Thread-102] org.apache.zookeeper.ZooKeeper - Session: 0xe5559cc045d3c6f closed
2016-06-22T19:25:46,387 INFO [main-EventThread] org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0xe5559cc045d3c6f
2016-06-22T19:25:46,388 INFO [Thread-102] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void com.metamx.emitter.service.ServiceEmitter.close() throws java.io.IOException] on object[com.metamx.emitter.service.ServiceEmitter@413bef78].
2016-06-22T19:25:46,391 INFO [Thread-102] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner.stop()] on object[io.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner@3f63a513].

drcrallen commented 8 years ago

The only way to get the middle manager responsive again was to delete the task directory for index_realtime_REDACTED_2016-06-22T18:00:00.000Z_48_0 and restart the middle manager.

Then the middle manager reported

2016-06-22T19:48:13,250 WARN [main] io.druid.indexing.overlord.ForkingTaskRunner - Failed to restore task[index_realtime_REDACTED_2016-06-22T18:00:00.000Z_48_0]. Trying to restore other tasks.
java.io.FileNotFoundException: /mnt/persistent/task/index_realtime_REDACTED_2016-06-22T18:00:00.000Z_48_0/task.json (No such file or directory)
        at java.io.FileInputStream.open0(Native Method) ~[?:1.8.0_91]
        at java.io.FileInputStream.open(FileInputStream.java:195) ~[?:1.8.0_91]
        at java.io.FileInputStream.<init>(FileInputStream.java:138) ~[?:1.8.0_91]
        at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:708) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2115) ~[druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at io.druid.indexing.overlord.ForkingTaskRunner.restore(ForkingTaskRunner.java:154) [druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at io.druid.indexing.worker.WorkerTaskMonitor.restoreRestorableTasks(WorkerTaskMonitor.java:169) [druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at io.druid.indexing.worker.WorkerTaskMonitor.start(WorkerTaskMonitor.java:108) [druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_91]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_91]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_91]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_91]
        at com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:350) [druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at com.metamx.common.lifecycle.Lifecycle.start(Lifecycle.java:259) [druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at io.druid.guice.LifecycleModule$2.start(LifecycleModule.java:155) [druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at io.druid.cli.GuiceRunnable.initLifecycle(GuiceRunnable.java:91) [druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at io.druid.cli.ServerRunnable.run(ServerRunnable.java:40) [druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
        at io.druid.cli.Main.main(Main.java:105) [druid-selfcontained-0.9.1-rc3-mmx0.jar:0.9.1-rc3-mmx0]
2016-06-22T19:48:13,254 INFO [main] io.druid.indexing.overlord.ForkingTaskRunner - Restored 9 tasks.

And continued as expected.

drcrallen commented 8 years ago

It is unknown at this time if this is a regression or not.

drcrallen commented 8 years ago

I filed https://github.com/druid-io/druid/pull/3172 to better be able to tell what things are holding the worker threads.

drcrallen commented 8 years ago

I forgot to mention in the original comment that during the shutdown the truant task launched as per the middle manager logs:

2016-06-22T19:25:45,719 INFO [Thread-68] com.metamx.common.lifecycle.Lifecycle - Running shutdown hook
2016-06-22T19:25:45,723 INFO [Thread-68] org.eclipse.jetty.server.ServerConnector - Stopped ServerConnector@4aa11206{HTTP/1.1}{0.0.0.0:8080}
2016-06-22T19:25:45,725 INFO [Thread-68] org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@6d2a2560{/,null,UNAVAILABLE}
2016-06-22T19:25:45,730 INFO [Thread-68] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.indexing.worker.WorkerTaskMonitor.stop() throws java.lang.InterruptedException] on object[io.druid.indexing.worker.WorkerTaskMonitor@1dba4e06].
2016-06-22T19:25:45,731 INFO [Thread-68] io.druid.indexing.overlord.ForkingTaskRunner - Unregistered listener [WorkerTaskMonitor]
2016-06-22T19:25:45,731 INFO [WorkerTaskMonitor] io.druid.indexing.worker.WorkerTaskMonitor - WorkerTaskMonitor interrupted, exiting.
2016-06-22T19:25:45,731 INFO [Thread-68] io.druid.indexing.overlord.ForkingTaskRunner - Closing output stream to task[index_realtime_REDACTED_2016-06-22T18:00:00.000Z_54_0].
2016-06-22T19:25:45,732 INFO [Thread-68] io.druid.indexing.overlord.ForkingTaskRunner - Closing output stream to task[index_realtime_REDACTED_2016-06-22T18:00:00.000Z_53_0].
2016-06-22T19:25:45,732 INFO [Thread-68] io.druid.indexing.overlord.ForkingTaskRunner - Closing output stream to task[index_realtime_REDACTED_2016-06-22T18:00:00.000Z_55_0].
2016-06-22T19:25:45,732 INFO [Thread-68] io.druid.indexing.overlord.ForkingTaskRunner - Closing output stream to task[index_realtime_REDACTED_2016-06-22T18:00:00.000Z_50_0].
2016-06-22T19:25:45,733 INFO [Thread-68] io.druid.indexing.overlord.ForkingTaskRunner - Closing output stream to task[index_realtime_REDACTED_2016-06-22T18:00:00.000Z_52_0].
2016-06-22T19:25:45,733 INFO [Thread-68] io.druid.indexing.overlord.ForkingTaskRunner - Closing output stream to task[index_realtime_REDACTED_2016-06-22T18:00:00.000Z_51_0].
2016-06-22T19:25:45,733 INFO [Thread-68] io.druid.indexing.overlord.ForkingTaskRunner - Closing output stream to task[index_realtime_REDACTED2_2016-06-22T16:00:00.000Z_0_0].
2016-06-22T19:25:45,734 INFO [Thread-68] io.druid.indexing.overlord.ForkingTaskRunner - Closing output stream to task[index_realtime_REDACTED_2016-06-22T18:00:00.000Z_46_1].
2016-06-22T19:25:45,734 INFO [Thread-68] io.druid.indexing.overlord.ForkingTaskRunner - Closing output stream to task[index_realtime_REDACTED_2016-06-22T18:00:00.000Z_48_0].
2016-06-22T19:25:45,735 INFO [Thread-68] io.druid.indexing.overlord.ForkingTaskRunner - Waiting up to 300,000ms for shutdown.
2016-06-22T19:25:46,389 INFO [forking-task-runner-8] io.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[2] for task: index_realtime_REDACTED_2016-06-22T18:00:00.000Z_54_0
2016-06-22T19:25:46,390 INFO [forking-task-runner-8] io.druid.storage.s3.S3TaskLogs - Pushing task log /mnt/persistent/task/index_realtime_REDACTED_2016-06-22T18:00:00.000Z_54_0/log to: prod/logs/v1/index_realtime_REDACTED_2016-06-22T18:00:00.000Z_54_0/log
2016-06-22T19:25:46,415 INFO [forking-task-runner-5] io.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[2] for task: index_realtime_REDACTED_2016-06-22T18:00:00.000Z_55_0
2016-06-22T19:25:46,416 INFO [forking-task-runner-5] io.druid.storage.s3.S3TaskLogs - Pushing task log /mnt/persistent/task/index_realtime_REDACTED_2016-06-22T18:00:00.000Z_55_0/log to: prod/logs/v1/index_realtime_REDACTED_2016-06-22T18:00:00.000Z_55_0/log
2016-06-22T19:25:46,417 INFO [forking-task-runner-2] io.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[2] for task: index_realtime_REDACTED_2016-06-22T18:00:00.000Z_50_0
2016-06-22T19:25:46,418 INFO [forking-task-runner-2] io.druid.storage.s3.S3TaskLogs - Pushing task log /mnt/persistent/task/index_realtime_REDACTED_2016-06-22T18:00:00.000Z_50_0/log to: prod/logs/v1/index_realtime_REDACTED_2016-06-22T18:00:00.000Z_50_0/log
2016-06-22T19:25:46,423 INFO [forking-task-runner-0] io.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[2] for task: index_realtime_REDACTED_2016-06-22T18:00:00.000Z_51_0
2016-06-22T19:25:46,424 INFO [forking-task-runner-0] io.druid.storage.s3.S3TaskLogs - Pushing task log /mnt/persistent/task/index_realtime_REDACTED_2016-06-22T18:00:00.000Z_51_0/log to: prod/logs/v1/index_realtime_REDACTED_2016-06-22T18:00:00.000Z_51_0/log
2016-06-22T19:25:46,437 INFO [forking-task-runner-7] io.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[2] for task: index_realtime_REDACTED_2016-06-22T18:00:00.000Z_53_0
2016-06-22T19:25:46,438 INFO [forking-task-runner-7] io.druid.storage.s3.S3TaskLogs - Pushing task log /mnt/persistent/task/index_realtime_REDACTED_2016-06-22T18:00:00.000Z_53_0/log to: prod/logs/v1/index_realtime_REDACTED_2016-06-22T18:00:00.000Z_53_0/log
2016-06-22T19:25:46,441 INFO [forking-task-runner-1] io.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[2] for task: index_realtime_REDACTED_2016-06-22T18:00:00.000Z_46_1
2016-06-22T19:25:46,442 INFO [forking-task-runner-1] io.druid.storage.s3.S3TaskLogs - Pushing task log /mnt/persistent/task/index_realtime_REDACTED_2016-06-22T18:00:00.000Z_46_1/log to: prod/logs/v1/index_realtime_REDACTED_2016-06-22T18:00:00.000Z_46_1/log
2016-06-22T19:25:46,445 INFO [forking-task-runner-6] io.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[2] for task: index_realtime_REDACTED_2016-06-22T18:00:00.000Z_52_0
2016-06-22T19:25:46,445 INFO [forking-task-runner-6] io.druid.storage.s3.S3TaskLogs - Pushing task log /mnt/persistent/task/index_realtime_REDACTED_2016-06-22T18:00:00.000Z_52_0/log to: prod/logs/v1/index_realtime_REDACTED_2016-06-22T18:00:00.000Z_52_0/log
2016-06-22T19:25:46,559 INFO [forking-task-runner-2] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_2016-06-22T18:00:00.000Z_50_0] status changed to [FAILED].
2016-06-22T19:25:46,561 INFO [forking-task-runner-2] io.druid.indexing.overlord.ForkingTaskRunner - Running command: java -cp conf/:lib/druid-selfcontained-0.9.1-rc3-mmx0.jar: REDACTED io.druid.cli.Main internal peon /mnt/persistent/task/index_realtime_REDACTED_2016-06-22T18:00:00.000Z_56_0/task.json /mnt/persistent/task/index_realtime_REDACTED_2016-06-22T18:00:00.000Z_56_0/972865ff-d5bb-4f03-ba7f-57bff6e0d2ac/status.json --nodeType realtime
2016-06-22T19:25:46,562 INFO [forking-task-runner-2] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_2016-06-22T18:00:00.000Z_56_0] location changed to [TaskLocation{host='REDACTED_HOST', port=8103}].
2016-06-22T19:25:46,562 INFO [forking-task-runner-2] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_2016-06-22T18:00:00.000Z_56_0] status changed to [RUNNING].
2016-06-22T19:25:46,562 INFO [forking-task-runner-2] io.druid.indexing.overlord.ForkingTaskRunner - Logging task index_realtime_REDACTED_2016-06-22T18:00:00.000Z_56_0 output to: /mnt/persistent/task/index_realtime_REDACTED_2016-06-22T18:00:00.000Z_56_0/log
2016-06-22T19:25:46,595 INFO [forking-task-runner-5] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_2016-06-22T18:00:00.000Z_55_0] status changed to [FAILED].
2016-06-22T19:25:46,611 INFO [forking-task-runner-7] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_2016-06-22T18:00:00.000Z_53_0] status changed to [FAILED].
2016-06-22T19:25:46,619 INFO [forking-task-runner-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_2016-06-22T18:00:00.000Z_51_0] status changed to [FAILED].
2016-06-22T19:25:46,625 INFO [forking-task-runner-1] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_2016-06-22T18:00:00.000Z_46_1] status changed to [FAILED].
2016-06-22T19:25:46,638 INFO [forking-task-runner-8] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_2016-06-22T18:00:00.000Z_54_0] status changed to [FAILED].
2016-06-22T19:25:46,670 INFO [forking-task-runner-3] io.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[2] for task: index_realtime_REDACTED_2016-06-22T18:00:00.000Z_48_0
2016-06-22T19:25:46,670 INFO [forking-task-runner-3] io.druid.storage.s3.S3TaskLogs - Pushing task log /mnt/persistent/task/index_realtime_REDACTED_2016-06-22T18:00:00.000Z_48_0/log to: prod/logs/v1/index_realtime_REDACTED_2016-06-22T18:00:00.000Z_48_0/log
2016-06-22T19:25:46,710 INFO [forking-task-runner-6] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_2016-06-22T18:00:00.000Z_52_0] status changed to [FAILED].
2016-06-22T19:25:46,784 INFO [forking-task-runner-3] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_2016-06-22T18:00:00.000Z_48_0] status changed to [FAILED].
2016-06-22T19:25:48,837 INFO [forking-task-runner-4] io.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[2] for task: index_realtime_REDACTED2_2016-06-22T16:00:00.000Z_0_0
2016-06-22T19:25:48,838 INFO [forking-task-runner-4] io.druid.storage.s3.S3TaskLogs - Pushing task log /mnt/persistent/task/index_realtime_REDACTED2_2016-06-22T16:00:00.000Z_0_0/log to: prod/logs/v1/index_realtime_REDACTED2_2016-06-22T16:00:00.000Z_0_0/log

drcrallen commented 8 years ago

Task launches on middle manager in question as per overlord logs:

2016-06-22T15:55:02,164
2016-06-22T17:45:03,146
2016-06-22T17:45:03,176
2016-06-22T17:45:03,244
2016-06-22T17:45:03,277
2016-06-22T17:45:03,310
2016-06-22T17:45:03,342
2016-06-22T17:45:03,362
2016-06-22T17:45:03,404
2016-06-22T17:45:03,416 <--- _56_0

There weren't any overlord leadership changes or mm restarts in that time. And that's one more than the capacity for that node.

drcrallen commented 8 years ago

At 2016-06-22T17:45:00,077 the MM reported one task assigned. Then at 2016-06-22T17:50:00,080 it reported 10.

drcrallen commented 8 years ago

Might be related to https://github.com/druid-io/druid/pull/2521

drcrallen commented 8 years ago

Filtering overlord logs to just the worker of interest yields the following:

2016-06-22T17:45:03,146 INFO [rtr-pending-tasks-runner-1] io.druid.indexing.overlord.RemoteTaskRunner - Coordinator asking Worker[REDACTED_HOST] to add task[index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_46_1]
2016-06-22T17:45:03,150 INFO [rtr-pending-tasks-runner-1] io.druid.indexing.overlord.RemoteTaskRunner - Task index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_46_1 switched from pending to running (on [REDACTED_HOST])
2016-06-22T17:45:03,160 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[REDACTED_HOST] wrote RUNNING status for task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_46_1] on [TaskLocation{host='null', port=-1}]
2016-06-22T17:45:03,168 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[REDACTED_HOST] wrote RUNNING status for task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_46_1] on [TaskLocation{host='REDACTED_HOST', port=8100}]
2016-06-22T17:45:03,169 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_46_1] location changed to [TaskLocation{host='REDACTED_HOST', port=8100}].
2016-06-22T17:45:03,176 INFO [rtr-pending-tasks-runner-1] io.druid.indexing.overlord.RemoteTaskRunner - Coordinator asking Worker[REDACTED_HOST] to add task[index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_48_0]
2016-06-22T17:45:03,180 INFO [rtr-pending-tasks-runner-1] io.druid.indexing.overlord.RemoteTaskRunner - Task index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_48_0 switched from pending to running (on [REDACTED_HOST])
2016-06-22T17:45:03,189 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[REDACTED_HOST] wrote RUNNING status for task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_48_0] on [TaskLocation{host='null', port=-1}]
2016-06-22T17:45:03,193 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[REDACTED_HOST] wrote RUNNING status for task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_48_0] on [TaskLocation{host='REDACTED_HOST', port=8102}]
2016-06-22T17:45:03,193 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_48_0] location changed to [TaskLocation{host='REDACTED_HOST', port=8102}].
2016-06-22T17:45:03,244 INFO [rtr-pending-tasks-runner-1] io.druid.indexing.overlord.RemoteTaskRunner - Coordinator asking Worker[REDACTED_HOST] to add task[index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_50_0]
2016-06-22T17:45:03,249 INFO [rtr-pending-tasks-runner-1] io.druid.indexing.overlord.RemoteTaskRunner - Task index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_50_0 switched from pending to running (on [REDACTED_HOST])
2016-06-22T17:45:03,256 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[REDACTED_HOST] wrote RUNNING status for task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_50_0] on [TaskLocation{host='null', port=-1}]
2016-06-22T17:45:03,262 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[REDACTED_HOST] wrote RUNNING status for task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_50_0] on [TaskLocation{host='REDACTED_HOST', port=8103}]
2016-06-22T17:45:03,262 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_50_0] location changed to [TaskLocation{host='REDACTED_HOST', port=8103}].
2016-06-22T17:45:03,277 INFO [rtr-pending-tasks-runner-1] io.druid.indexing.overlord.RemoteTaskRunner - Coordinator asking Worker[REDACTED_HOST] to add task[index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_52_0]
2016-06-22T17:45:03,282 INFO [rtr-pending-tasks-runner-1] io.druid.indexing.overlord.RemoteTaskRunner - Task index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_52_0 switched from pending to running (on [REDACTED_HOST])
2016-06-22T17:45:03,289 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[REDACTED_HOST] wrote RUNNING status for task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_52_0] on [TaskLocation{host='null', port=-1}]
2016-06-22T17:45:03,296 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[REDACTED_HOST] wrote RUNNING status for task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_52_0] on [TaskLocation{host='REDACTED_HOST', port=8104}]
2016-06-22T17:45:03,296 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_52_0] location changed to [TaskLocation{host='REDACTED_HOST', port=8104}].
2016-06-22T17:45:03,310 INFO [rtr-pending-tasks-runner-1] io.druid.indexing.overlord.RemoteTaskRunner - Coordinator asking Worker[REDACTED_HOST] to add task[index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_53_0]
2016-06-22T17:45:03,315 INFO [rtr-pending-tasks-runner-1] io.druid.indexing.overlord.RemoteTaskRunner - Task index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_53_0 switched from pending to running (on [REDACTED_HOST])
2016-06-22T17:45:03,321 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[REDACTED_HOST] wrote RUNNING status for task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_53_0] on [TaskLocation{host='null', port=-1}]
2016-06-22T17:45:03,331 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[REDACTED_HOST] wrote RUNNING status for task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_53_0] on [TaskLocation{host='REDACTED_HOST', port=8105}]
2016-06-22T17:45:03,331 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_53_0] location changed to [TaskLocation{host='REDACTED_HOST', port=8105}].
2016-06-22T17:45:03,342 INFO [rtr-pending-tasks-runner-1] io.druid.indexing.overlord.RemoteTaskRunner - Coordinator asking Worker[REDACTED_HOST] to add task[index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_54_0]
2016-06-22T17:45:03,346 INFO [rtr-pending-tasks-runner-1] io.druid.indexing.overlord.RemoteTaskRunner - Task index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_54_0 switched from pending to running (on [REDACTED_HOST])
2016-06-22T17:45:03,353 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[REDACTED_HOST] wrote RUNNING status for task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_54_0] on [TaskLocation{host='null', port=-1}]
2016-06-22T17:45:03,359 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[REDACTED_HOST] wrote RUNNING status for task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_54_0] on [TaskLocation{host='REDACTED_HOST', port=8106}]
2016-06-22T17:45:03,359 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_54_0] location changed to [TaskLocation{host='REDACTED_HOST', port=8106}].
2016-06-22T17:45:03,362 INFO [rtr-pending-tasks-runner-2] io.druid.indexing.overlord.RemoteTaskRunner - Coordinator asking Worker[REDACTED_HOST] to add task[index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_51_0]
2016-06-22T17:45:03,366 INFO [rtr-pending-tasks-runner-2] io.druid.indexing.overlord.RemoteTaskRunner - Task index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_51_0 switched from pending to running (on [REDACTED_HOST])
2016-06-22T17:45:03,373 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[REDACTED_HOST] wrote RUNNING status for task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_51_0] on [TaskLocation{host='null', port=-1}]
2016-06-22T17:45:03,398 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[REDACTED_HOST] wrote RUNNING status for task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_51_0] on [TaskLocation{host='REDACTED_HOST', port=8107}]
2016-06-22T17:45:03,398 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_51_0] location changed to [TaskLocation{host='REDACTED_HOST', port=8107}].
2016-06-22T17:45:03,404 INFO [rtr-pending-tasks-runner-2] io.druid.indexing.overlord.RemoteTaskRunner - Coordinator asking Worker[REDACTED_HOST] to add task[index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_55_0]
2016-06-22T17:45:03,408 INFO [rtr-pending-tasks-runner-2] io.druid.indexing.overlord.RemoteTaskRunner - Task index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_55_0 switched from pending to running (on [REDACTED_HOST])
2016-06-22T17:45:03,416 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[REDACTED_HOST] wrote RUNNING status for task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_55_0] on [TaskLocation{host='null', port=-1}]
2016-06-22T17:45:03,416 INFO [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.RemoteTaskRunner - Coordinator asking Worker[REDACTED_HOST] to add task[index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_56_0]
2016-06-22T17:45:03,421 INFO [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.RemoteTaskRunner - Task index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_56_0 switched from pending to running (on [REDACTED_HOST])
2016-06-22T17:45:03,426 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[REDACTED_HOST] wrote RUNNING status for task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_55_0] on [TaskLocation{host='REDACTED_HOST', port=8108}]
2016-06-22T17:45:03,426 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_55_0] location changed to [TaskLocation{host='REDACTED_HOST', port=8108}].
2016-06-22T17:45:03,428 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[REDACTED_HOST] wrote RUNNING status for task [index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_56_0] on [TaskLocation{host='null', port=-1}]

It is worth noting that the node that went over capacity violated the normal logging flow somehow.

Coordinator asking Worker[REDACTED_HOST] to add task[index_realtime_REDACTED_DATASOURCE_2016-06-22T18:00:00.000Z_56_0] happened BEFORE the log of 55_0 showing the new location, by a few ms.

I'm not sure if this is expected or not, but it looks like assignment to the worker is supposed to be blocked. Digging more

drcrallen commented 8 years ago

Also note the rtr pending task launcher was in different threads.

drcrallen commented 8 years ago

@gianm note that tranquility does NOT like the fact that a task is "RUNNING" but has null host and -1 port

nishantmonu51 commented 8 years ago

worker select strategy is responsible for ensuring that it selects a worker which has available slot to run the task. WorkerSelectStrategy works on an immutable view of cluster passed to it by RTR. So it is possible that with multiple assignment threads, the select strategy will choose an incorrect worker.

One possible way to fix this can be to add a check in RTR announceTask() protected by statusLock for capacity and retrying.

drcrallen commented 8 years ago

I think I found it:

RTR has a guard against multiple things trying to launch on the same worker via workersWithUnacknowledgedTask.putIfAbsent(immutableZkWorker.get().getWorker().getHost(), task.getId(). But this is enforced AFTER the check for space on the workers through strategy.findWorkerForTask

As such, it is possible for N tasks (where N is the number of launching threads) to all think that worker W has capacity, and for them ALL to pass the check if the exact wrong kind of race occurs. Which means that a worker can be over-subscribed up to N-1 tasks.

drcrallen commented 8 years ago

I'm setting druid.indexer.runner.pendingTasksRunnerNumThreads=1 for the overlord to ensure this does not occur.

gianm commented 8 years ago

@drcrallen shall we move this to 0.9.2 given that #3184 works around it?

gianm commented 8 years ago

oops, move to 0.9.2, I mean.

drcrallen commented 8 years ago

@gianm https://groups.google.com/d/msg/druid-development/McGZdcnO0Ws/9gz8UxQDDAAJ

gianm commented 8 years ago

ah, thanks

drcrallen commented 8 years ago

General consensus seems to be to leave a complete fix for 0.9.2, re-assigning milestone

xvrl commented 8 years ago

Given that this is the second regression introduced by https://github.com/druid-io/druid/pull/2521, would it make sense to have another review of that PR before we declare stable?

himanshug commented 8 years ago

@drcrallen thanks for detailed investigation.

race described here should be fixable by #3205 , however likelihood of this race surfacing in practice seems very low and I'm wondering if there is something else that we are potentially missing.

we have been running #2521 cherry-picked to internal druid 0.8.2 version, will continue to monitor if we see this race happening there.

drcrallen commented 8 years ago

@himanshug it surfaced in practice while we were qualifying 0.9.1

himanshug commented 8 years ago

@drcrallen I agree that you noticed worker running over capacity but I am just trying to think of other things that could cause the issue besides the race described.

@gianm can you confirm , that when announceTask(..) method exits , zkWorkers would be guaranteed to have updated view of that worker running the new task? I think it is , but just confirming.

nishantmonu51 commented 8 years ago

fixed by #3205

apache / druid

Overlord assigns too many tasks to middle manager #3174