long range scans with concurrency are almost 2X faster without update_row_stats

mdcallag commented 7 years ago

There are many memory system stalls from update_row stats while running sysbench read-only with concurrency. I am confused because that code is similar to how InnoDB updates srv_stats.n_rows_read and I don't see this problem for upstream InnoDB.

I get almost 2X more QPS for sysbench read-only with --oltp-range-size=10000 when update_row_stats is removed at high levels of concurrency (16 or more clients). This is also an issue for sysbench read-write that shares many of the queries.

I will try to figure out why this is only a problem for MyRocks then I will reassign. I prefer that the global counters get updated at query end rather than on every fetched row

https://github.com/facebook/mysql-5.6/blob/webscalesql-5.6.27.75/storage/rocksdb/ha_rocksdb.cc#L100

mdcallag commented 7 years ago

Test was sysbench run via all.sh at https://github.com/mdcallag/mytools/tree/master/bench/sysbench.lua

Command line was:bash all.sh 8 1000000 180 300 rocksdb 1 0

hedengcheng commented 7 years ago

Hi, Mark

I think this may be the same problem which was optimized on MySQL 5.6. On update the same counter, cache coherence and false sharing is very costly. Lock at this two links:

https://blogs.oracle.com/mysqlinnodb/entry/innodb_performance_improvements ... After using the excellent Linux tool perf we were able to narrow it down to a global statistic counter in row_sel_search_for_mysql().

http://mikaelronstrom.blogspot.com/2012/04/mysql-team-increases-scalability-by-50.html

he dengcheng

mdcallag commented 7 years ago

Thanks. I have been using perf and I like it too. We borrowed the code from innodb that has an internal array for global counters and then threads hash to different offsets in the array. So we shouldn't have this problem, but we do. I need to debug more.

On Tue, Oct 18, 2016 at 6:39 PM, hedengcheng notifications@github.com wrote:

Hi, Mark

I think this may be the same problem which was optimized on MySQL 5.6. On update the same counter, cache coherence and false sharing is very costly. Lock at this two links:

https://blogs.oracle.com/mysqlinnodb/entry/innodb_performance_improvements ... After using the excellent Linux tool perf we were able to narrow it down to a global statistic counter in row_sel_search_for_mysql().

http://mikaelronstrom.blogspot.com/2012/04/mysql- team-increases-scalability-by-50.html

he dengcheng

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/facebook/mysql-5.6/issues/343#issuecomment-254687216, or mute the thread https://github.com/notifications/unsubscribe-auth/ABkKTccKIOlhqiKM6GxO1DOmN9iv9uf3ks5q1XTFgaJpZM4KaNPb .

Mark Callaghan mdcallag@gmail.com

mdcallag commented 7 years ago

Should have posted this earlier. From reading the code we share the features in ut0counter that support less memory contention for counters. My current theory is that more threads maps to the same array offset in these counters with MyRocks than with InnoDB. Still checking that.

The problem is memory system contention from update_row_stats which is called on every row fetch. Add concurrency and queries that fetch 10,000 rows and RocksDB gets slow. QPS is almost doubled with the call to update_row_stats removed.

sysbench read-only QPS with --oltp-range-size=10000

1    16    48    concurrency/engine
242  3720  6676  InnoDB
205  2185  3057  MyRocks, Aug5 build
205  3229  5619  MyRocks, Aug5 build, update_row_stats removed

I am confused because InnoDB has a similar counter.

from ha_innobase::general_fetch(
...
        case DB_SUCCESS:
                error = 0;
                table->status = 0;
                srv_stats.n_rows_read.add((size_t) prebuilt->trx->id, 1);

And that uses ib_counter_t just like MyRocks

struct srv_stats_t {
        typedef ib_counter_t<lsn_t, 1, single_indexer_t> lsn_ctr_1_t;
        typedef ib_counter_t<ulint, 1, single_indexer_t> ulint_ctr_1_t;
        typedef ib_counter_t<lint, 1, single_indexer_t> lint_ctr_1_t;
        typedef ib_counter_t<ulint, 64> ulint_ctr_64_t;
        typedef ib_counter_t<ib_int64_t, 1, single_indexer_t> ib_int64_ctr_1_t;
...
        /** Number of rows read. */
        ulint_ctr_64_t          n_rows_read;

typedef ib_counter_t<ulint, 64> ulint_ctr_64_t;

MyRocks calls this from ::rnd_next_with_direction

  if (!rc) {
    stats.rows_read++;
    stats.rows_index_next++;
    update_row_stats(ROWS_READ);
  }

And update_row_stats


/**
  Updates row counters based on the table type and operation type.
*/
void ha_rocksdb::update_row_stats(operation_type type) {
  DBUG_ASSERT(type < ROWS_MAX);
  // Find if we are modifying system databases.
  if (table->s && m_tbl_def->m_is_mysql_system_table)
    global_stats.system_rows[type].inc();
  else
    global_stats.rows[type].inc();
}

And the counter type

enum operation_type {
  ROWS_DELETED = 0,
  ROWS_INSERTED,
  ROWS_READ,
  ROWS_UPDATED,
  ROWS_MAX
};

/* Global statistics struct used inside MyRocks */
struct st_global_stats {
  ib_counter_t<ulonglong, 64> rows[ROWS_MAX];

  // system_rows_ stats are only for system
  // tables. They are not counted in rows_* stats.
  ib_counter_t<ulonglong, 64> system_rows[ROWS_MAX];
};

/** Class for using fuzzy counters. The counter is not protected by any
mutex and the results are not guaranteed to be 100% accurate but close
enough. Creates an array of counters and separates each element by the
CACHE_LINE_SIZE bytes */
template <
        typename Type,
        int N = IB_N_SLOTS,
        template<typename, int> class Indexer = thread_id_indexer_t>
class ib_counter_t {
...
        /** If you can't use a good index id. Increment by 1. */
        void inc() { add(1); }

        /** If you can't use a good index id.
        * @param n  - is the amount to increment */
        void add(Type n) {
                size_t  i = m_policy.offset(m_policy.get_rnd_index());

                DBUG_ASSERT(i < UT_ARRAY_SIZE(m_counter));

                m_counter[i] += n;
        }

        /** Indexer into the array */
        Indexer<Type, N>m_policy;

        /** Slot 0 is unused. */
        Type            m_counter[(N + 1) * (CACHE_LINE_SIZE / sizeof(Type))];

from storage/rocksdb/ha_rocksdb.cc

static st_global_stats global_stats;
static st_export_stats export_stats;

laurynas-biveinis commented 7 years ago

Since RocksDB appears to be using InnoDB counter implementation, https://bugs.mysql.com/bug.php?id=79454 might be relevant for this particular issue

yoshinorim commented 7 years ago

CC: @santoshbanda

mdcallag commented 7 years ago

Laurynas - thanks for the update. AFAIK, we are using the thread ID indexer. Next steps for me are to look at preprocessor output and then log the thread IDs that are being used.

On Wed, Oct 19, 2016 at 9:22 AM, Yoshinori Matsunobu < notifications@github.com> wrote:

CC: @santoshbanda https://github.com/santoshbanda

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/facebook/mysql-5.6/issues/343#issuecomment-254864671, or mute the thread https://github.com/notifications/unsubscribe-auth/ABkKTajPnxJNknr0BeOIQ40Mfx4SEDOnks5q1kPAgaJpZM4KaNPb .

Mark Callaghan mdcallag@gmail.com

mdcallag commented 7 years ago

Laurynas - I will try the sched_getcpu() scheduler as explained in the bug report. Thanks!

On Wed, Oct 19, 2016 at 9:39 AM, MARK CALLAGHAN mdcallag@gmail.com wrote:

Laurynas - thanks for the update. AFAIK, we are using the thread ID indexer. Next steps for me are to look at preprocessor output and then log the thread IDs that are being used.

On Wed, Oct 19, 2016 at 9:22 AM, Yoshinori Matsunobu < notifications@github.com> wrote:

CC: @santoshbanda https://github.com/santoshbanda

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/facebook/mysql-5.6/issues/343#issuecomment-254864671, or mute the thread https://github.com/notifications/unsubscribe-auth/ABkKTajPnxJNknr0BeOIQ40Mfx4SEDOnks5q1kPAgaJpZM4KaNPb .

Mark Callaghan mdcallag@gmail.com

Mark Callaghan mdcallag@gmail.com

george-lorch commented 7 years ago

TokuDB/PerconaFT suffered from this pretty severely a while ago and they invented a partitioned counter to deal with it. It ends up being a real mess as there is a set of counters per thread, all of which have to be reaped whenever the counters are harvested. It helps for short term bursts of counter updates but doesn't help if/when a thread gets rescheduled on a different core and requires a fairly large amount of memory for the counter set.

We have a proof of concept in PerconaFT that shows the sched_getcpu() works very well but the issue is that it isn't fully supported on all desired platforms so some platform specific magic may need inventing.

mdcallag commented 7 years ago

Using server with 2 sockets, 24 cores, 48 HW threads I tried sysbench read-only with --oltp-range-size=10000, 8 tables and 1M rows/table. QPS per engine is:

6403 - InnoDB 5.6.26 - this does not use sched_getcpu
8108 - InnoDB 5.7.10 - I assume this uses sched_getcpu, did not confirm
3090 - MyRocks from August 5
6273 - MyRocks from August 5, use of rows_read counter disabled
6093 - MyRocks from August 5, rows_read counter enabled, uses sched_getcpu

So it looks like sched_getcpu is a good thing here. Next I will look at the thread IDs in use by MyRocks to understand whether there is too much sharing of array entries.

mdcallag commented 7 years ago

I think that using get_sched_indexer_t fixes the problem for st_global_stats from ha_rocksdb.h

struct st_global_stats {
  ib_counter_t<ulonglong, 64, get_sched_indexer_t> rows[ROWS_MAX];
...
  ib_counter_t<ulonglong, 64, get_sched_indexer_t> system_rows[ROWS_MAX];
};

But to enable that we need 2 things. First, HAVE_SCHED_GETCPU has to be defined and it is not for 5.6. That is defined in 5.7 by storage/innobase/innodb.cmake. Second, we have to make sure the build isn't broken on systems that don't have the sched_getcpu call, and see storage/innobase/include/ut0counter.h for in some versions of MySQL 5.7

Note that ut0counter has changed multiple times in 5.7. The version in 5.7.15 doesn't use sched_getcpu. For our use case I think we should use sched_getcpu.

CHECK_FUNCTION_EXISTS(sched_getcpu  HAVE_SCHED_GETCPU)
IF(HAVE_SCHED_GETCPU)
 ADD_DEFINITIONS(-DHAVE_SCHED_GETCPU=1)
ENDIF()

mdcallag commented 7 years ago

Laurnyas - thanks again for the bug report. It was a good read: http://bugs.mysql.com/bug.php?id=79454 http://bugs.mysql.com/bug.php?id=79455

manukranthk commented 7 years ago

Can something like folly/ThreadCachedInt.h be used here as suggested by georgelorchpercona?

mdcallag commented 7 years ago

Using thread local counters would be more invasive about the amount of code to be changed. While it might be the right thing, I prefer to make such changes upstream as forks like FB MySQL also optimize to minimize the diff size. In this case, using sched_getcpu() instead of pthread_self() is good enough.

On Wed, Oct 19, 2016 at 5:16 PM, manukranthk notifications@github.com wrote:

Can something like folly/ThreadCachedInt.h be used here as suggested by georgelorchpercona?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/facebook/mysql-5.6/issues/343#issuecomment-254976692, or mute the thread https://github.com/notifications/unsubscribe-auth/ABkKTW8z8ST2VfMg8s4h0J7Ga4SJtE4Uks5q1rLYgaJpZM4KaNPb .

Mark Callaghan mdcallag@gmail.com

jkedgar commented 7 years ago

Here's a diff to change this (https://reviews.facebook.net/D65427) @mdcallag, I could also add a new indexer if sched_getcpu is not available that would use the the thread id from pthread_self() but would right shift the value by some number of bits (6?) to get rid of all the automatic 0 bits to the right in a pointer. Does that sound like a good idea or should I just leave it as it is?

jkedgar commented 7 years ago

The code would look similar to this:

template <typename Type, int N>
struct rdb_thread_id_indexer_t : public generic_indexer_t<Type, N> {
  size_t get_rnd_index() const {
#if defined(__WIN__)
    return GetCurrentThreadId();
#elif sizeof(void *) == 8
    return (size_t) (((uintptr_t) pthread_self()) >> 6);
#else
    return (size_t) (((uintptr_t) pthread_self()) >> 5);
#endif
  }
}

And we would use rdb_thread_id_indexer_t instead of the default thread_id_indexer_t.

mdcallag commented 7 years ago

Fixed by https://github.com/facebook/mysql-5.6/commit/d83fe82c8f42023cda6a97021361d90bae39805c

facebook / mysql-5.6

long range scans with concurrency are almost 2X faster without update_row_stats #343