OpenTSDB / opentsdb

A scalable, distributed Time Series Database.
http://opentsdb.net
GNU Lesser General Public License v2.1
4.99k stars 1.25k forks source link

OpenTSDB import fails with RegionTooBusyException #878

Open HariSekhon opened 7 years ago

HariSekhon commented 7 years ago

When bulk loading to OpenTSDB on HBase I consistently get a RegionTooBusyException. I've checked #757 but we've already solved that bug as I upgraded our clusters to HDP 2.5 which contains the patch for HBase and we've tested the .tmp data volume doesn't increase.

I think the correct response to this would be for the opentsdb import bulk loader to catch and retry with exponential backoff, similar to the solution accepted for #867.

It currently looks like it retries 4 times immediately (timestamps show 1 millisecond apart), which doesn't give enough time for HBase to clear it's backlog.

ERROR [AsyncHBase I/O Worker #2] CompactionQueue: Failed to delete a row to re-compact
org.hbase.async.RemoteException: org.apache.hadoop.hbase.RegionTooBusyException: Above memstore limit, regionName=tsdbDebugHari,,<region>, server=<fqdn>,160
20,1475755417456, memstoreSize=2019989080, blockingMemStoreSize=536870912
        at org.apache.hadoop.hbase.regionserver.HRegion.checkResources(HRegion.java:3750)
        at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2924)
        at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2875)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:715)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:677)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2054)
        at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32303)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2127)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
        at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
        at java.lang.Thread.run(Thread.java:745)

        at org.hbase.async.RegionClient.makeException(RegionClient.java:1738) [asynchbase-1.7.1.jar:na]
        at org.hbase.async.RegionClient.decodeExceptionPair(RegionClient.java:1772) [asynchbase-1.7.1.jar:na]
        at org.hbase.async.MultiAction.deserialize(MultiAction.java:615) ~[asynchbase-1.7.1.jar:na]
        at org.hbase.async.RegionClient.decode(RegionClient.java:1480) [asynchbase-1.7.1.jar:na]
        at org.hbase.async.RegionClient.decode(RegionClient.java:88) [asynchbase-1.7.1.jar:na]
        at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:500) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:485) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) [netty-3.9.4.Final.jar:na]
        at org.hbase.async.RegionClient.handleUpstream(RegionClient.java:1206) [asynchbase-1.7.1.jar:na]
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.channel.SimpleChannelHandler.messageReceived(SimpleChannelHandler.java:142) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.handler.timeout.IdleStateAwareChannelHandler.handleUpstream(IdleStateAwareChannelHandler.java:36) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.handler.timeout.IdleStateHandler.messageReceived(IdleStateHandler.java:294) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) [netty-3.9.4.Final.jar:na]
        at org.hbase.async.HBaseClient$RegionClientPipeline.sendUpstream(HBaseClient.java:3108) [asynchbase-1.7.1.jar:na]
        at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [netty-3.9.4.Final.jar:na]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_95]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_95]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_95]
manolama commented 7 years ago

Yeah, we definitely need that back-off code in there. One of our engineers wrote something for AsyncHBase that we may be able to co-opt for this purpose.

manolama commented 7 years ago

We'll get this in 2.3.1 most likely. It'll need some testing.

degremont commented 6 years ago

👍

I'm trying to bulk import a lot of data from an OpenTSDB migration, and I'm facing this error a lot! As I'm importing a lot of data, I'm not surprised HBase can keep up and is throwing this exception when compatcing and so on. But each time it does this, tsdb import does not handle it and just exit with error, failing the whole import. This is a pain to manage.

Please handle this exception! I do not need an exponential backoff. A just simple wait and retry loop will do the trick! It will save my whole import and will help manage this big opentsdb migration.

manolama commented 5 years ago

Still need it, just gotta work it out.