Configurable Scanner timeout to avoid OOM errors in TSDB

jarajapu commented 10 years ago

While running a load test on the TSDB that involved a decent volume of calls to the /api/query end point (call rate of 1-2 calls/sec) for couple hours, we noticed that the JVM running TSDB threw an OOM and became extremely slow.

We took a heap dump and analyzed the profile to notice several large data sets in memory (500K - 1.5M rows) related to queries that were taking as long as 600 - 700 seconds to be processed by asyncHBase.

Here is a stack trace :

19:09:20.370 WARN [HashedWheelTimer.warn] - An exception was thrown by TimerTask. java.lang.OutOfMemoryError: Java heap space at org.jboss.netty.buffer.HeapChannelBuffer.(HeapChannelBuffer.java:42) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.buffer.BigEndianHeapChannelBuffer.(BigEndianHeapChannelBuffer.java:34) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.buffer.ChannelBuffers.buffer(ChannelBuffers.java:134) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.buffer.ChannelBuffers.buffer(ChannelBuffers.java:121) ~[netty-3.6.2.Final.jar:na] at org.hbase.async.HBaseRpc.newBuffer(HBaseRpc.java:685) ~[asynchbase-1.5.0.jar:d543609] at org.hbase.async.MultiAction.serializeOld(MultiAction.java:254) ~[asynchbase-1.5.0.jar:d543609] at org.hbase.async.MultiAction.serialize(MultiAction.java:210) ~[asynchbase-1.5.0.jar:d543609] at org.hbase.async.RegionClient.encode(RegionClient.java:1144) ~[asynchbase-1.5.0.jar:d543609] at org.hbase.async.RegionClient.sendRpc(RegionClient.java:892) ~[asynchbase-1.5.0.jar:d543609] at org.hbase.async.RegionClient.periodicFlush(RegionClient.java:257) ~[asynchbase-1.5.0.jar:d543609] at org.hbase.async.RegionClient.access$000(RegionClient.java:89) ~[asynchbase-1.5.0.jar:d543609] at org.hbase.async.RegionClient$1.run(RegionClient.java:195) ~[asynchbase-1.5.0.jar:d543609] at org.jboss.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:529) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.util.HashedWheelTimer$Worker.notifyExpiredTimeouts(HashedWheelTimer.java:440) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:379) ~[netty-3.6.2.Final.jar:na] at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45] 19:09:25.745 WARN [HashedWheelTimer.warn] - An exception was thrown by TimerTask. java.lang.OutOfMemoryError: GC overhead limit exceeded at org.jboss.netty.buffer.HeapChannelBuffer.(HeapChannelBuffer.java:42) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.buffer.BigEndianHeapChannelBuffer.(BigEndianHeapChannelBuffer.java:34) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.buffer.ChannelBuffers.buffer(ChannelBuffers.java:134) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.buffer.ChannelBuffers.buffer(ChannelBuffers.java:121) ~[netty-3.6.2.Final.jar:na] at org.hbase.async.HBaseRpc.newBuffer(HBaseRpc.java:685) ~[asynchbase-1.5.0.jar:d543609] at org.hbase.async.MultiAction.serializeOld(MultiAction.java:254) ~[asynchbase-1.5.0.jar:d543609] at org.hbase.async.MultiAction.serialize(MultiAction.java:210) ~[asynchbase-1.5.0.jar:d543609] at org.hbase.async.RegionClient.encode(RegionClient.java:1144) ~[asynchbase-1.5.0.jar:d543609] at org.hbase.async.RegionClient.sendRpc(RegionClient.java:892) ~[asynchbase-1.5.0.jar:d543609] at org.hbase.async.RegionClient.periodicFlush(RegionClient.java:257) ~[asynchbase-1.5.0.jar:d543609] at org.hbase.async.RegionClient.access$000(RegionClient.java:89) ~[asynchbase-1.5.0.jar:d543609] at org.hbase.async.RegionClient$1.run(RegionClient.java:195) ~[asynchbase-1.5.0.jar:d543609] at org.jboss.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:529) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.util.HashedWheelTimer$Worker.notifyExpiredTimeouts(HashedWheelTimer.java:440) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:379) ~[netty-3.6.2.Final.jar:na] at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45] 19:09:29.556 WARN [AbstractNioSelector.warn] - Unexpected exception in the selector loop. java.lang.OutOfMemoryError: GC overhead limit exceeded at org.jboss.netty.buffer.HeapChannelBuffer.(HeapChannelBuffer.java:42) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.buffer.BigEndianHeapChannelBuffer.(BigEndianHeapChannelBuffer.java:34) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.buffer.ChannelBuffers.buffer(ChannelBuffers.java:134) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.buffer.HeapChannelBufferFactory.getBuffer(HeapChannelBufferFactory.java:68) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.buffer.AbstractChannelBufferFactory.getBuffer(AbstractChannelBufferFactory.java:48) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:80) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:107) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) ~[netty-3.6.2.Final.jar:na] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_45] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_45] at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45] 19:09:30.822 WARN [HashedWheelTimer.warn] - An exception was thrown by TimerTask. java.lang.OutOfMemoryError: Java heap space at org.jboss.netty.buffer.HeapChannelBuffer.(HeapChannelBuffer.java:42) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.buffer.BigEndianHeapChannelBuffer.(BigEndianHeapChannelBuffer.java:34) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.buffer.ChannelBuffers.buffer(ChannelBuffers.java:134) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.buffer.ChannelBuffers.buffer(ChannelBuffers.java:121) ~[netty-3.6.2.Final.jar:na] at org.hbase.async.HBaseRpc.newBuffer(HBaseRpc.java:685) ~[asynchbase-1.5.0.jar:d543609] at org.hbase.async.MultiAction.serializeOld(MultiAction.java:254) ~[asynchbase-1.5.0.jar:d543609] at org.hbase.async.MultiAction.serialize(MultiAction.java:210) ~[asynchbase-1.5.0.jar:d543609] at org.hbase.async.RegionClient.encode(RegionClient.java:1144) ~[asynchbase-1.5.0.jar:d543609] at org.hbase.async.RegionClient.sendRpc(RegionClient.java:892) ~[asynchbase-1.5.0.jar:d543609] at org.hbase.async.RegionClient.periodicFlush(RegionClient.java:257) ~[asynchbase-1.5.0.jar:d543609] at org.hbase.async.RegionClient.access$000(RegionClient.java:89) ~[asynchbase-1.5.0.jar:d543609] at org.hbase.async.RegionClient$1.run(RegionClient.java:195) ~[asynchbase-1.5.0.jar:d543609] at org.jboss.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:529) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.util.HashedWheelTimer$Worker.notifyExpiredTimeouts(HashedWheelTimer.java:440) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:379) ~[netty-3.6.2.Final.jar:na] at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45] 19:09:30.823 ERROR [RegionClient.decode] - Invalid rpcid: 2634753 found in ReplayingDecoderBuffer(ridx=13971, widx=407148)=(failed to extract content of buffer of type org.jboss.netty.handler.codec.replay.ReplayingDecoderBuffer) 19:09:31.394 ERROR [RegionClient.exceptionCaught] - Unexpected exception from downstream on [id: 0x57b68366, /10.20.73.94:58046 => /10.20.66.10:60020] org.hbase.async.NonRecoverableException: Invalid rpcid: 2634753 found in ReplayingDecoderBuffer(ridx=13971, widx=407148)=(failed to extract content of buffer of type org.jboss.netty.handler.codec.replay.ReplayingDecoderBuffer) at org.hbase.async.RegionClient.decode(RegionClient.java:1290) ~[asynchbase-1.5.0.jar:d543609] at org.hbase.async.RegionClient.decode(RegionClient.java:89) ~[asynchbase-1.5.0.jar:d543609] at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:500) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:485) ~[netty-3.6.2.Final.jar:na] at org.hbase.async.RegionClient.handleUpstream(RegionClient.java:1080) ~[asynchbase-1.5.0.jar:d543609] at org.hbase.async.HBaseClient$RegionClientPipeline.sendUpstream(HBaseClient.java:2652) ~[asynchbase-1.5.0.jar:d543609] at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:107) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88) ~[netty-3.6.2.Final.jar:na] at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) ~[netty-3.6.2.Final.jar:na] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_45] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_45] at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45] 19:09:31.400 INFO [HBaseClient.removeClientFromCache] - Lost connection with the -ROOT- region

jmangs commented 10 years ago

A little more information on this:

Basically we had a Ruby script running against our DB looking for metrics to clean-up - I was running 8 concurrent requests to OpenTSDB. We managed to hit a few metrics that have millions of data points - basically it keeps retrieving data from HBase even though it's been running for 500-1000 seconds already.

The VM that this node was on ran out of memory so random parts starting throwing OOM errors without actually crashing the Deferred queries; we had 8 queries each occupying about 400-500MB of JVM memory. In our case, we'd rather have these queries fail so we don't lock up all of our workers running queries that will never finish in a reasonable time.

What I did was to add a configurable timeout property that errors out ScannerCB() after a certain amount of time. I'll be sending a PR with my feature; I think it's a nice feature to have. It's also defaulted to infinite wait by setting this timeout property to -1 (more info in the actual PR once it's sent)

manolama commented 10 years ago

Great fix for this. I'll pull #290 into 2.1.

dqminh commented 7 years ago

@manolama i was investigating why timeout doesnt seem to have an effect on our system.

We used salting and looks like the timeout is only applied on non-salting Scanner (i.e. one in TsdbQuery ), not one in SaltScanner. Should this be fixed ?

OpenTSDB / opentsdb

Configurable Scanner timeout to avoid OOM errors in TSDB #289