Open bobrik opened 8 years ago
Heya, That's really interesting. Could you compare the HBase CPU utilization over time when using appends vs puts? Also see if you can track some HBase stats like the queue size and queue/process call time. What could be happening is that since appends take longer to process on the HBase side, they're batched and sent to HBase (as shown by both imports completing in about 13 seconds) but HBase is taking a LOT longer to respond to AsyncHBase that the appends have completed processing. I would expect at most a 2x decrease in throughput but not that much. (oh and check for GC in the region server)
I've tried reproducing the issue locally with HBase 1.1.3RC0 in standalone mode with default settings.
The test happened in the following order:
tsdb
/jmx
endpoint of region server.tsdb
/jmx
endpoint of region server.With appends: gist with logs and jmx
2016-01-04 13:40:14,549 INFO [main] TextImporter: reading from file:/Users/bobrik/metrics.log
2016-01-04 13:45:04,898 INFO [main] TextImporter: ... 1000000 data points in 290350ms (3444.1 points/s)
2016-01-04 13:45:04,901 INFO [main] TextImporter: Processed /Users/bobrik/metrics.log in 290357 ms, 1000000 data points (3444.0 points/s)
2016-01-04 13:45:04,904 INFO [main] TextImporter: Total: imported 1000000 data points in 290.360s (3444.0 points/s)
Without appends: gist with logs and jmx
2016-01-04 13:56:54,479 INFO [main] TextImporter: reading from file:/Users/bobrik/metrics.log
2016-01-04 13:57:30,222 INFO [main] TextImporter: ... 1000000 data points in 35748ms (27973.6 points/s)
2016-01-04 13:57:30,223 INFO [main] TextImporter: Processed /Users/bobrik/metrics.log in 35751 ms, 1000000 data points (27971.2 points/s)
2016-01-04 13:57:30,226 INFO [main] TextImporter: Total: imported 1000000 data points in 35.754s (27968.7 points/s)
This behavior is a bit different from described initially, since OpenTSDB exits at expected point in time, but performance gap is still enormous.
I've made a production-looking cluster with one RS and pointed a portion of traffic on at it. Appends were enabled at 07:00 UTC. Sadly, OpenTSDB does not report appends in stats:
Metrics from HBase are below:
Here 2 regionservers are actually masters.
And finally container resource usage for regionserver:
The rise of WAL appends is disturbing. Hopefully this gives a brighter picture.
Hmm, at first I was thinking the WAL bypass may have been ignored for the import + append path but it's definitely set (https://github.com/OpenTSDB/opentsdb/blob/master/src/core/IncomingDataPoints.java#L327). However I wonder if it's WAL bypass is honored for appends. It may also need to sync the WAL before completing the append. Then I though it could be cache related but that's looking pretty good as far as hits/misses. Also the memstore size doesn't seem to be deviating too much, but if you can allocate more space to the memstore, that should help the appends a bit as it may need to hit disk less often. Next, try looking at your disk IO stats, I'd be interested to see if it's slamming or even swapping at that point.
@bobrik Can you gist your hbase-site.xml and hbase-env.sh files?
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.useMulti</name>
<value>true</value>
</property>
<property>
<name>zookeeper.znode.parent</name>
<value>${zookeeper.znode.parent}</value>
</property>
<property>
<name>hbase.rootdir</name>
<value>${hbase.rootdir}</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>${hbase.zookeeper.quorum}</value>
</property>
</configuration>
# default
export HBASE_OPTS="-XX:+UseConcMarkSweepGC"
# prometheus-jmx-export agent
if [ -n "${HBASE_JMX_EXPORTER_PORT_ENV}" ]; then
if [ -n "${HBASE_JMX_EXPORTER_ROLE}" ]; then
HBASE_JMX_EXPORTER_JAR=/usr/share/jmx_exporter/agent.jar
HBASE_JMX_EXPORTER_PORT="${!HBASE_JMX_EXPORTER_PORT_ENV}"
HBASE_JMX_EXPORTER_CONF=/usr/local/hbase/conf/jmx-exporter.yaml
sed "s/%cluster%/${HBASE_IDENT_STRING}/g" -i "${HBASE_JMX_EXPORTER_CONF}"
sed "s/%role%/${HBASE_JMX_EXPORTER_ROLE}/g" -i "${HBASE_JMX_EXPORTER_CONF}"
export HBASE_OPTS="${HBASE_OPTS} -javaagent:${HBASE_JMX_EXPORTER_JAR}=${HBASE_JMX_EXPORTER_PORT}:${HBASE_JMX_EXPORTER_CONF}"
fi
fi
# set things to be used in hbase-site.xml,
# just setting arbitrary properties here won't work
export HBASE_OPTS="${HBASE_OPTS} -Dzookeeper.znode.parent=/hbase/${HBASE_IDENT_STRING}"
export HBASE_OPTS="${HBASE_OPTS} -Dhbase.rootdir=${HBASE_ROOTDIR}"
export HBASE_OPTS="${HBASE_OPTS} -Dhbase.zookeeper.quorum=${HBASE_ZOOKEEPER_QUORUM}"
# external JMX RMI
if [ -n "${HBASE_JMX_PORT_ENV}" ]; then
HBASE_JMX_PORT="${!HBASE_JMX_PORT_ENV}"
HBASE_JMX_OPTS="-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"
HBASE_JMX_OPTS="${HBASE_JMX_OPTS} -Dcom.sun.management.jmxremote.port=${HBASE_JMX_PORT}"
export HBASE_OPTS="${HBASE_OPTS} ${HBASE_JMX_OPTS}"
fi
I launch region servers with the following settings:
-Dhbase.regionserver.global.memstore.size=0.5
-Dhfile.block.cache.size=0.2
-Dhbase.hstore.blockingStoreFiles=128
-Dhbase.hregion.memstore.flush.size=2147483648
(2GB)-Dhbase.regionserver.global.memstore.size.lower.limit=0.8
-Dhbase.hregion.max.filesize=107374182400
(100GB)-Dhbase.bucketcache.ioengine=file:/hbase/cache/offheap
-Dhbase.bucketcache.size=40960
(40GB)-Dhbase.regionserver.handler.count=128
This is 1.2.0 and I'm still seeing the same behavior.
OpenTSDB:
HBase:
Machine (sda is ssd with offheap cache, everything else is hadoop):
Cache seems suspicious.
I don't think your writes are keeping up when switching to appends, plotting GC types and times or using visualvm will help. I would drop the off heap cache, switch to G1GC, and try a large heap and some tuning below first. CMS isn't going to lend well to that.
some relevant things from my hbase-env.sh:
export HBASE_OPTS="-server -XX:+UseG1GC -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=100 -XX:ParallelGCThreads=8"
More HBase G1GC tuning info here http://www.slideshare.net/HBaseCon/dev-session-7-49202969
some relevant things from my hbase-site.xml:
<!-- Allow HBase clients to talk to more than one region in parallel (HBASE-2939) -->
<property> <name>hbase.client.ipc.pool.type</name> <value>RoundRobinPool</value> </property>
<property> <name>hbase.client.ipc.pool.size</name> <value>32</value> </property>
<!-- Disable Nagle's algorithm -->
<property> <name>hbase.ipc.client.tcpnodelay</name> <value>true</value> </property>
<!-- Give each region its own WAL, so writes can occur in parallel (HBASE-5699) -->
<property> <name>hbase.wal.provider</name> <value>multiwal</value> </property>
<!-- Increased from default of 2 for more batching with (HBASE-5699) -->
<property> <name>hbase.wal.regiongrouping.numgroups</name> <value>4</value> </property>
<!-- The handler limit seems too low and results in "Call queue is full" on UI.. we have a big heap -->
<property> <name>hbase.regionserver.handler.count</name> <value>64</value> </property>
<!-- Read perf: split reads from writes in the call queue -->
<property> <name>hbase.ipc.server.callqueue.read.ratio</name> <value>0.4</value> </property>
<!-- We could probably keep reads and scans on same queues but it triggers startup bug (HBASE-14252) -->
<property> <name>hbase.ipc.server.callqueue.scan.ratio</name> <value>0.5</value> </property>
<!-- Increase the number of queues to threads for less contention (default 0.1) -->
<property> <name>hbase.ipc.server.callqueue.handler.factor</name> <value>0.5</value> </property>
<!-- Scan multiple storefiles in parallel -->
<property> <name>hbase.storescanner.parallel.seek.enable</name> <value>true</value> </property>
I will aggregate some of this in a blog post some time. I'm stuck on older hardware so you may be able to turn some of the values up for G1GC parallelism etc.
I can easily do puts at 1M/s (I haven't tried doing more), but appends blow up at 100K/s. Is it really worth it? I've tried G1 before and impact was mostly negative: 2-5x more CPU usage (not just GC!).
Pretty much my experience: https://twitter.com/rcmuir/status/720965802988564480
Not to discount the possibility of bugs, but TSDB isn't forcing you on to appends, if your usage works without them then keep doing that.
The appends tradeoff I see is that you get much more predictable performance if you can keep up. With appends, your cluster isn't going to shit the bed if someone does a ton of reads and the block cache will always stay nicely populated. The negative is significantly higher resource usage on HBase.
G1 isn't about throughput. STW collection is best for that. G1 will allow you to use a massive heap while not blowing pause times to ridiculous levels that will cause the nodes to drop out. For your use case, I'd like to see a 100G+ heap, with plenty of room for memstore. I'm thinking you should be able to get a lot closer to put throughput as long as there is CPU and RAM to eat.
HBase 1.2.0 has a bad bug in procedure dispatch that causes a ton of CPU spin and garbage creation, make sure to use 1.2.1 if you are running vanilla.
Hi,What the monitor tools for opentsdb?
I took 1 million datapoints (sampled from real data) and tried to import them into OpenTDSB 2.2.0RC3.
With regular puts and disabled compactions:
With appends enabled:
This brings appends from 80000 points/s to 1000000 / (183 + 12.5) = 5100 points/s. Both times before importing I truncated
tsdb
table, but lefttsdb-uid
alone to avoid uid allocations (wink).I'm running HBase 1.1.2 and during import even
count
in hbase shell failed to start:Did I miss something in configuration?