OpenTSDB / opentsdb

A scalable, distributed Time Series Database.
http://opentsdb.net
GNU Lesser General Public License v2.1
5k stars 1.25k forks source link

Appends are very slow and accumulate invisibly #661

Open bobrik opened 8 years ago

bobrik commented 8 years ago

I took 1 million datapoints (sampled from real data) and tried to import them into OpenTDSB 2.2.0RC3.

With regular puts and disabled compactions:

2015-12-14 17:50:53,034 INFO  [main] Config: Successfully loaded configuration file: /etc/opentsdb/opentsdb.conf

...

2015-12-14 17:50:53,285 INFO  [main] TextImporter: reading from file:/home/ivan/metrics.log
2015-12-14 17:51:06,277 INFO  [main] TextImporter: ... 1000000 data points in 12989ms (76988.2 points/s)
2015-12-14 17:51:06,279 INFO  [main] TextImporter: Processed /home/ivan/metrics.log in 12994 ms, 1000000 data points (76958.6 points/s)
2015-12-14 17:51:06,279 INFO  [main] TextImporter: Total: imported 1000000 data points in 12.995s (76954.5 points/s)

... 13s to import, cool

2015-12-14 17:51:06,415 INFO  [AsyncHBase I/O Worker #2] TSDB: Completed shutting down the TSDB

... exited momentarily, so 13 total.

With appends enabled:

2015-12-14 17:51:48,533 INFO  [main] Config: Successfully loaded configuration file: /etc/opentsdb/opentsdb.conf

...

2015-12-14 17:51:48,788 INFO  [main] TextImporter: reading from file:/home/ivan/metrics.log
2015-12-14 17:52:01,225 INFO  [main] TextImporter: ... 1000000 data points in 12433ms (80431.1 points/s)
2015-12-14 17:52:01,226 INFO  [main] TextImporter: Processed /home/ivan/metrics.log in 12437 ms, 1000000 data points (80405.2 points/s)
2015-12-14 17:52:01,226 INFO  [main] TextImporter: Total: imported 1000000 data points in 12.438s (80397.0 points/s)

... 12.5s to import, nice.

2015-12-14 17:55:05,075 INFO  [AsyncHBase I/O Worker #2] TSDB: Completed shutting down the TSDB

... exited 183s after import reported as finished!

This brings appends from 80000 points/s to 1000000 / (183 + 12.5) = 5100 points/s. Both times before importing I truncated tsdb table, but left tsdb-uid alone to avoid uid allocations (wink).

I'm running HBase 1.1.2 and during import even count in hbase shell failed to start:

hbase(main):007:0> count 'tsdb'

ERROR: Call id=163245, waitTime=60001, operationTimeout=60000 expired.

Did I miss something in configuration?

manolama commented 8 years ago

Heya, That's really interesting. Could you compare the HBase CPU utilization over time when using appends vs puts? Also see if you can track some HBase stats like the queue size and queue/process call time. What could be happening is that since appends take longer to process on the HBase side, they're batched and sent to HBase (as shown by both imports completing in about 13 seconds) but HBase is taking a LOT longer to respond to AsyncHBase that the appends have completed processing. I would expect at most a 2x decrease in throughput but not that much. (oh and check for GC in the region server)

bobrik commented 8 years ago

I've tried reproducing the issue locally with HBase 1.1.3RC0 in standalone mode with default settings.

The test happened in the following order:

  1. Start HBase.
  2. Create tables.
  3. Do initial import to allocate all UIDS.
  4. Truncate tsdb
  5. Stop HBase.
  6. Start HBase and run tests with appends.
  7. Collect JMX from /jmx endpoint of region server.
  8. Truncate tsdb
  9. Stop HBase.
  10. Start HBase and run tests without appends.
  11. Collect JMX from /jmx endpoint of region server.

With appends: gist with logs and jmx

2016-01-04 13:40:14,549 INFO  [main] TextImporter: reading from file:/Users/bobrik/metrics.log
2016-01-04 13:45:04,898 INFO  [main] TextImporter: ... 1000000 data points in 290350ms (3444.1 points/s)
2016-01-04 13:45:04,901 INFO  [main] TextImporter: Processed /Users/bobrik/metrics.log in 290357 ms, 1000000 data points (3444.0 points/s)
2016-01-04 13:45:04,904 INFO  [main] TextImporter: Total: imported 1000000 data points in 290.360s (3444.0 points/s)

Without appends: gist with logs and jmx

2016-01-04 13:56:54,479 INFO  [main] TextImporter: reading from file:/Users/bobrik/metrics.log
2016-01-04 13:57:30,222 INFO  [main] TextImporter: ... 1000000 data points in 35748ms (27973.6 points/s)
2016-01-04 13:57:30,223 INFO  [main] TextImporter: Processed /Users/bobrik/metrics.log in 35751 ms, 1000000 data points (27971.2 points/s)
2016-01-04 13:57:30,226 INFO  [main] TextImporter: Total: imported 1000000 data points in 35.754s (27968.7 points/s)

This behavior is a bit different from described initially, since OpenTSDB exits at expected point in time, but performance gap is still enormous.

bobrik commented 8 years ago

I've made a production-looking cluster with one RS and pointed a portion of traffic on at it. Appends were enabled at 07:00 UTC. Sadly, OpenTSDB does not report appends in stats:

image

Metrics from HBase are below:

image

image

image

image

Here 2 regionservers are actually masters.

image

image

And finally container resource usage for regionserver:

image

The rise of WAL appends is disturbing. Hopefully this gives a brighter picture.

manolama commented 8 years ago

Hmm, at first I was thinking the WAL bypass may have been ignored for the import + append path but it's definitely set (https://github.com/OpenTSDB/opentsdb/blob/master/src/core/IncomingDataPoints.java#L327). However I wonder if it's WAL bypass is honored for appends. It may also need to sync the WAL before completing the append. Then I though it could be cache related but that's looking pretty good as far as hits/misses. Also the memstore size doesn't seem to be deviating too much, but if you can allocate more space to the memstore, that should help the appends a bit as it may need to hit disk less often. Next, try looking at your disk IO stats, I'd be interested to see if it's slamming or even swapping at that point.

kev009 commented 8 years ago

@bobrik Can you gist your hbase-site.xml and hbase-env.sh files?

bobrik commented 8 years ago
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.zookeeper.useMulti</name>
    <value>true</value>
  </property>
  <property>
    <name>zookeeper.znode.parent</name>
    <value>${zookeeper.znode.parent}</value>
  </property>
  <property>
    <name>hbase.rootdir</name>
    <value>${hbase.rootdir}</value>
  </property>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>${hbase.zookeeper.quorum}</value>
  </property>
</configuration>
# default
export HBASE_OPTS="-XX:+UseConcMarkSweepGC"

# prometheus-jmx-export agent
if [ -n "${HBASE_JMX_EXPORTER_PORT_ENV}" ]; then
  if [ -n "${HBASE_JMX_EXPORTER_ROLE}" ]; then
    HBASE_JMX_EXPORTER_JAR=/usr/share/jmx_exporter/agent.jar
    HBASE_JMX_EXPORTER_PORT="${!HBASE_JMX_EXPORTER_PORT_ENV}"
    HBASE_JMX_EXPORTER_CONF=/usr/local/hbase/conf/jmx-exporter.yaml

    sed "s/%cluster%/${HBASE_IDENT_STRING}/g"   -i "${HBASE_JMX_EXPORTER_CONF}"
    sed "s/%role%/${HBASE_JMX_EXPORTER_ROLE}/g" -i "${HBASE_JMX_EXPORTER_CONF}"

    export HBASE_OPTS="${HBASE_OPTS} -javaagent:${HBASE_JMX_EXPORTER_JAR}=${HBASE_JMX_EXPORTER_PORT}:${HBASE_JMX_EXPORTER_CONF}"
  fi
fi

# set things to be used in hbase-site.xml,
# just setting arbitrary properties here won't work
export HBASE_OPTS="${HBASE_OPTS} -Dzookeeper.znode.parent=/hbase/${HBASE_IDENT_STRING}"
export HBASE_OPTS="${HBASE_OPTS} -Dhbase.rootdir=${HBASE_ROOTDIR}"
export HBASE_OPTS="${HBASE_OPTS} -Dhbase.zookeeper.quorum=${HBASE_ZOOKEEPER_QUORUM}"

# external JMX RMI
if [ -n "${HBASE_JMX_PORT_ENV}" ]; then
  HBASE_JMX_PORT="${!HBASE_JMX_PORT_ENV}"
  HBASE_JMX_OPTS="-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"
  HBASE_JMX_OPTS="${HBASE_JMX_OPTS} -Dcom.sun.management.jmxremote.port=${HBASE_JMX_PORT}"
  export HBASE_OPTS="${HBASE_OPTS} ${HBASE_JMX_OPTS}"
fi

I launch region servers with the following settings:

This is 1.2.0 and I'm still seeing the same behavior.

OpenTSDB:

image

image

HBase:

image

image

image

image

image

image

Machine (sda is ssd with offheap cache, everything else is hadoop):

image

image

image

Cache seems suspicious.

kev009 commented 8 years ago

I don't think your writes are keeping up when switching to appends, plotting GC types and times or using visualvm will help. I would drop the off heap cache, switch to G1GC, and try a large heap and some tuning below first. CMS isn't going to lend well to that.

some relevant things from my hbase-env.sh:

export HBASE_OPTS="-server -XX:+UseG1GC -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=100 -XX:ParallelGCThreads=8"

More HBase G1GC tuning info here http://www.slideshare.net/HBaseCon/dev-session-7-49202969

some relevant things from my hbase-site.xml:

  <!-- Allow HBase clients to talk to more than one region in parallel (HBASE-2939) -->
  <property> <name>hbase.client.ipc.pool.type</name> <value>RoundRobinPool</value> </property>
  <property> <name>hbase.client.ipc.pool.size</name> <value>32</value> </property>
  <!-- Disable Nagle's algorithm -->
  <property> <name>hbase.ipc.client.tcpnodelay</name> <value>true</value> </property>
  <!-- Give each region its own WAL, so writes can occur in parallel (HBASE-5699) -->
  <property> <name>hbase.wal.provider</name> <value>multiwal</value> </property>
  <!-- Increased from default of 2 for more batching with (HBASE-5699) -->
  <property> <name>hbase.wal.regiongrouping.numgroups</name> <value>4</value> </property>
  <!-- The handler limit seems too low and results in "Call queue is full" on UI..  we have a big heap -->
  <property> <name>hbase.regionserver.handler.count</name> <value>64</value> </property>
  <!-- Read perf:  split reads from writes in the call queue -->
  <property> <name>hbase.ipc.server.callqueue.read.ratio</name> <value>0.4</value> </property>
  <!-- We could probably keep reads and scans on same queues but it triggers startup bug (HBASE-14252) -->
  <property> <name>hbase.ipc.server.callqueue.scan.ratio</name> <value>0.5</value> </property>
  <!-- Increase the number of queues to threads for less contention (default 0.1) -->
  <property> <name>hbase.ipc.server.callqueue.handler.factor</name> <value>0.5</value> </property>
  <!-- Scan multiple storefiles in parallel -->
  <property> <name>hbase.storescanner.parallel.seek.enable</name> <value>true</value> </property>

I will aggregate some of this in a blog post some time. I'm stuck on older hardware so you may be able to turn some of the values up for G1GC parallelism etc.

bobrik commented 8 years ago

I can easily do puts at 1M/s (I haven't tried doing more), but appends blow up at 100K/s. Is it really worth it? I've tried G1 before and impact was mostly negative: 2-5x more CPU usage (not just GC!).

Pretty much my experience: https://twitter.com/rcmuir/status/720965802988564480

kev009 commented 8 years ago

Not to discount the possibility of bugs, but TSDB isn't forcing you on to appends, if your usage works without them then keep doing that.

The appends tradeoff I see is that you get much more predictable performance if you can keep up. With appends, your cluster isn't going to shit the bed if someone does a ton of reads and the block cache will always stay nicely populated. The negative is significantly higher resource usage on HBase.

G1 isn't about throughput. STW collection is best for that. G1 will allow you to use a massive heap while not blowing pause times to ridiculous levels that will cause the nodes to drop out. For your use case, I'd like to see a 100G+ heap, with plenty of room for memstore. I'm thinking you should be able to get a lot closer to put throughput as long as there is CPU and RAM to eat.

HBase 1.2.0 has a bad bug in procedure dispatch that causes a ton of CPU spin and garbage creation, make sure to use 1.2.1 if you are running vanilla.

weihongsheng commented 6 years ago

Hi,What the monitor tools for opentsdb?

a928755613 commented 6 years ago