OpenTSDB / opentsdb

A scalable, distributed Time Series Database.
http://opentsdb.net
GNU Lesser General Public License v2.1
4.99k stars 1.25k forks source link

tsd fsck warning message #895

Open lordang opened 7 years ago

lordang commented 7 years ago

It seems I have tsd name and UID mapping error on uid table. Our cluster has large tagv value cuz we use client ip as tagv. And when I executed uid fsck command, uid java process used all RAM (we have 128G RAM) and continuously ran GC and comsumed all CPU and RAM. And then I got following warning message.

2016-11-23 10:19:16,882 WARN [New I/O worker #6] Scanner: RegionInfo(table="tsdb-uid", region_name="tsdb-uid,\x1Bx\xB4\xB7,1475154085093.65338521f3a7a06523eec77f11e2ca23.", stop_key="168220431") pretends to not know Scanner(table="tsdb-uid", start_key="!q\x10\xB2", stop_key="", columns=org.hbase.async.UnknownScannerException: org.apache.hadoop.hbase.UnknownScannerException: Name: 1484925, already closed? at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:1966) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30438) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2016) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:110) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:90) at java.lang.Thread.run(Thread.java:745)

Caused by RPC: GetNextRowsRequest(scanner_id=0x000000000006A87D, max_num_rows=1024, region=null, attempt=0), populate_blockcache=true, max_num_rows=1024, max_num_kvs=4096, region=null, filter=null, scanner_id=0x000000000006A87D). I will retry to open a scanner but this is typically because you've been holding the scanner open and idle for too long (possibly due to a long GC pause on your side or in the RegionServer) 2016-11-23 10:19:16,887 ERROR [main] UidManager: Duplicate reverse tagv mapping: 284081517 -> 284081517 and 284081517 -> 217110B2. kv=KeyValue(key="!q\x10\xB2", family="name", qualifier="tagv", value="284081517", timestamp=1460540461278)

Can I ignore this message and continue running fsck and wait for end? Or Must I increase RAM and try again?

manolama commented 7 years ago

Hello @lordang, The scanner exception you're seeing is normal for JVM undergoing massive GC as the underlying connection to HBase will be killed after a timeout period.

But fsck shouldn't eat up 128G of RAM so it sounds like there's a bug in there. If you could restart it and take a heap-dump of the JVM at around 4G or so I'd love to see it. Then we can fix it up. Thanks!

lordang commented 7 years ago

I took heap dump, but it's too big to attach to github. It's about 4GB. How can I show this?

manolama commented 7 years ago

If you can drop-box it or post it in a GDrive that would be great.

lordang commented 7 years ago

Here's my heap dump. https://www.dropbox.com/s/halddyh80kyuxb0/fsck_dump.hprof?dl=0