facebook / mysql-5.6

Facebook's branch of the Oracle MySQL database. This includes MyRocks.
http://myrocks.io
Other
2.48k stars 712 forks source link

READ_INFO::read_field() is bottle neck of LOAD stmt #864

Open rockeet opened 6 years ago

rockeet commented 6 years ago

MySQL script:

CREATE TABLE ShortTable (
    strkey varchar(40) PRIMARY KEY,
    i1  int(10),
    i2  int(10),
    i3  int(10),
    i4  int(10)
);
SET UNIQUE_CHECKS=0;
LOAD DATA INFILE 'somefile.txt' INTO TABLE ShortTable;

write_buffer_size was set to 20G to disable db flush & compactions.

Then we load 4GB data into MySQL.

In perf-top, READ_INFO::read_field() is the top function, it consumed 8.79% CPU time. The loading speed is greatly slower than InnoDB.

mdcallag commented 6 years ago

When I wrote this I missed write_buffer_size=20g, but even after trying that the results are similar. Otherwise I used write_buffer_size=64m.

What are you using for MyRocks and InnoDB?

I can't reproduce this using:

Given that this was a CPU bottleneck in ::read_field I don't think my.cnf tuning should matter.

I tried with 1M, 2M, 4M and 8M rows and MyRocks was faster in every case.

seconds to load
MyRocks InnoDB  rows    MyRocks datadir size
 4.216   6.059  1M      < 1gb
 8.557  12.127  2M      1.6gb
17.282  24.692  4M      2.1gb
35.218  49.355  8M      5.3gb

The git hashes for MyRocks are:

git_hash        96de6c4308d1ca9884c01391355e09457089e2b7
rocksdb_git_hash        31ee4bf2400cf1807718192d352ca8bd7837ada2
rockeet commented 6 years ago

The different should be you use sysbench schema which value length is larger than key, in our case key length is larger.

mdcallag commented 6 years ago

Maybe, but could you answer my questions about the versions of MyRocks and InnoDB used?

Even better would be a script to generate the data to load. That will save me time.

mdcallag commented 6 years ago

Would also help to know how many rows were in that 4GB of data.

mdcallag commented 6 years ago
  1. ::read_field is from upstream. AFAIK we have not touched that code. But you haven't told me what versions of MySQL are being compared here.
  2. You also are vague about the perf difference. How long does it take for MyRocks vs InnoDB.
  3. Finally I am not even sure whether what is being compared. I don't even know whether MyRocks or InnodB are CPU or IO bound -- is fsync on commit disabled or enabled for both?

While I appreciate bug reports, please provide sufficient detail when you first report a bug.

With 10M rows and something that matches your schema, 34 secs for MyRocks, 53 for InnoDB. With 100M rows, ~390 secs for MyRocks, ~570 secs for InnoDB. fsync on commit is disabled for both and MyRocks is tuned similar to the advice from the MyRocks github wiki.

Script to generate data was... for i in $( seq 1 X ) ; do printf "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa%d\t%d\t%d\t%d\t%d\n" $i $i $i $i $i done

Using perf, top-N functions for MyRocks ...

    25.90%  my-oneconnectio  mysqld               [.] rocksdb::WriteBatchEntryComparator::operator()
    14.13%  my-oneconnectio  libc-2.23.so         [.] __memcmp_sse4_1
     9.99%  my-oneconnectio  mysqld               [.] READ_INFO::read_field
     6.18%  my-oneconnectio  mysqld               [.] rocksdb::WriteBatchWithIndex::Rep::UpdateExistingEntryWithCfId
     6.05%  my-oneconnectio  mysqld               [.] myrocks::Rdb_pk_comparator::Compare
     3.43%  my-oneconnectio  mysqld               [.] mysql_load
     2.02%  my-oneconnectio  mysqld               [.] rocksdb::SkipList<rocksdb::WriteBatchIndexEntry*, rocksdb::WriteBatchEntryComparator const&>::Insert
     1.41%  my-oneconnectio  mysqld               [.] my_strntoull10rnd_8bit
     1.31%  my-oneconnectio  mysqld               [.] pack_row
     1.21%  my-oneconnectio  mysqld               [.] memcmp@plt

And for InnoDB

     9.99%  mysqld           mysqld              [.] cmp_dtuple_rec_with_match_low
     8.44%  mysqld           mysqld              [.] rec_get_offsets_func
     5.47%  mysqld           mysqld              [.] page_cur_search_with_match
     5.38%  mysqld           mysqld              [.] READ_INFO::read_field
     4.76%  mysqld           mysqld              [.] buf_page_get_gen
     3.98%  mysqld           mysqld              [.] log_group_write_buf
     3.92%  mysqld           mysqld              [.] os_aio_linux_handle
     2.65%  mysqld           mysqld              [.] btr_cur_search_to_nth_level
     2.57%  mysqld           mysqld              [.] page_cur_insert_rec_write_log
     2.44%  mysqld           libc-2.23.so        [.] __memset_avx2
mdcallag commented 6 years ago

Repeated the test with write_buffer_size set to 64mb and 20gb, results were similar, about 390 secs. Then repeated test with a default my.cnf for MyRocks and results were still about 390 secs.

rockeet commented 6 years ago

Very thanks for your effort, our test has several difference with yours:

  1. dataset is real business data, we are trying to write a datagen tool to generate similar data.
  2. we had not disable wal sync on commit.
  3. we set commit on middle batch to 1000.

There still are some other details I can not figure out now, since the engineer for this test is in vacation, I will paste the result as soon.

Thank you again!

mdcallag commented 6 years ago

I asked twice above about the versions of MyRocks and InnoDB that you have been using. You have yet to answer so I will going to stop helping on this bug. Next time please be less vague when filing a bug report. I ended up asking a lot of questions. Try to provide enough detail so that someone can reproduce the problem that you see -- which includes details on the versions of code used, the my.cnf options, some way to simulate the data, etc.

wal sync on or off is fine as long as InnoDB has similar setting. Although with commit-in-middle you might have the opportunity to do more commits with MyRocks than InnoDB. I rarely use commit-in-middle but suggest setting it to something larger than 1000. I didn't use commit-in-middle for my results above.

With rocksdb_commit_in_the_middle MyRocks is even faster than above -- about 260 seconds with rocksdb_bulk_load_size=1000 and then 250 seconds with rocksdb_bulk_load_size=10000. Again, wal sync was off.

rockeet commented 6 years ago

Sorry our detailed test info is late.

We have re-run the test and collected the result as below:

Test environment & test data

  1. MyRocks version is fb-prod201704(26adcc9).

When we test with rocksdb_commit_in_the_middle=off, we hit an outdated bug about wrong using of vector::reserve, we hotfixed the bug on this version, the result is based on the fixed code.

  1. my.conf for innodb
    [mysqld]
    character-set-server  = utf8
    collation-server      = utf8_bin
    performance_schema    = off
    default-storage-engine = INNODB
    innodb_buffer_pool_size = 16G
    innodb_buffer_pool_instances = 1
    innodb_flush_log_at_trx_commit=1 # this conf is set by demand
    # dir configs...
  2. my.conf for rocksdb
    [mysqld]
    character-set-server = utf8
    collation-server     = utf8_bin
    rocksdb
    default-storage-engine=rocksdb
    skip-innodb
    default-tmp-storage-engine = MyISAM
    rocksdb_default_cf_options = write_buffer_size=20g
    rocksdb_bulk_load_size          = 1000
    rocksdb_commit_in_the_middle    = ON # this conf is set by demand
    rocksdb_flush_log_at_trx_commit = 1  # this conf is set by demand
    # dir configs...
  3. simulation data gen: data-gen.py
    #!/usr/bin/python3
    import random
    import time
    ap = 10000
    t = time.time()
    def get_str(dic):
    global t
    global ap
    s = str(int(t)) + str(ap) + '_sh' + dic[random.randint(0, 999)]
    delimiter = ', '
    g = lambda r : '0' if r < 0.3 \
              else '1' if r < 0.6 \
              else '2' if r < 0.95 \
              else '120' if r < 0.98 \
              else '130'
    for i in range(4):
        r= random.random()
        s += delimiter + g(r)
    ap += 1
    return s
    def generate(n):
    dic = {}
    for i in range(0, 1000):
        dic[i] = str(random.randint(10000, 99999))
    for i in range(1, n):
        print get_str(dic)
    if __name__ == "__main__":
    generate(70000000)

Test result

We wrote data-gen.py to generate simulate data: 70,000,000 items with total 2.37 GB.

innodb result is copyed in each table for compare with rocksdb:

  1. rocksdb_commit_in_the_middle = on and rocksdb_bulk_load_size = 1000
fsync innodb rocksdb
on 381.5 sec 419.3 sec
off 383.7 sec 185.8 sec
  1. rocksdb_commit_in_the_middle = off
fsync innodb rocksdb
on 381.5 sec 431.4 sec
off 383.7 sec 426.6 sec

Top functions are similar what ever fsync is.

  1. rocksdb
    9.63%  my-oneconnectio  mysqld               [.] READ_INFO::read_field
    8.74%  my-oneconnectio  libc-2.17.so         [.] __memcmp_sse4_1
    7.61%  my-oneconnectio  mysqld               [.] rocksdb::SkipList<rocksdb::WriteBatchIndexEntry*, rocksdb::WriteBatchEntryComparator<true> const&>::FindGreaterOrEqual
    6.33%  my-oneconnectio  mysqld               [.] myrocks::Rdb_pk_comparator::Compare
    4.20%  rocksdb:bg0      mysqld               [.] snappy::internal::CompressFragment
    3.34%  my-oneconnectio  mysqld               [.] my_strnxfrm_unicode
    3.02%  my-oneconnectio  mysqld               [.] rocksdb::MemTable::KeyComparator::operator()
    2.77%  my-oneconnectio  mysqld               [.] rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator const&>::Insert<false>
    2.38%  my-oneconnectio  mysqld               [.] rocksdb::crc32c::ExtendImpl<&rocksdb::crc32c::Slow_CRC32>
    2.22%  my-oneconnectio  mysqld               [.] rocksdb::InternalKeyComparator::Compare
    2.09%  my-oneconnectio  mysqld               [.] mysql_load
    1.93%  my-oneconnectio  mysqld               [.] myrocks::Rdb_key_def::pack_with_varchar_space_pad
  2. innodb
    10.19%  mysqld   mysqld              [.] rec_get_offsets_func
    8.28%  mysqld   mysqld              [.] my_strnncollsp_mb_bin
    6.90%  mysqld   mysqld              [.] buf_page_get_gen
    6.73%  mysqld   mysqld              [.] page_cur_search_with_match
    6.22%  mysqld   mysqld              [.] cmp_dtuple_rec_with_match_low
    5.59%  mysqld   mysqld              [.] READ_INFO::read_field
    4.71%  mysqld   mysqld              [.] log_group_write_buf
    4.62%  mysqld   mysqld              [.] buf_calc_page_new_checksum
    2.32%  mysqld   mysqld              [.] innobase_mysql_cmp
    2.31%  mysqld   mysqld              [.] cmp_whole_field

    Conclusion

  3. The slowness is caused by fsync, when fsync = 0, rocksdb is faster than innodb.
  4. rocksdb_commit_in_the_middle = on can enhance performance.
mdcallag commented 6 years ago

Conclusions would be more useful if you told us what version you use for InnoDB. This is the 4th time I asked for it on this bug report. 1) What provides InnoDB -- upstream (Oracle), Percona Server, MariaDB? 2) What version is InnoDB from 5.6.?, 5.7.? or 8.0.?

You are using MyRocks from Dec, 2017 which uses RocksDB from git hash 8486ea. That is old code and I can't reproduce it on current MyRocks.

mdcallag commented 6 years ago

It would also be good to know whether this is a custom fork of MyRocks, with many changes to it or MySQL that nobody else has access to.

rockeet commented 6 years ago
  1. InnoDB is the same mysqld as MyRocks with InnoDB's my.conf.
  2. Code in the test is fb-prod201704(26adcc9) with one line change on rocksdb:

    diff --git a/util/arena.cc b/util/arena.cc
    index 1382cb1..52fb411 100644
    --- a/util/arena.cc
    +++ b/util/arena.cc
    @@ -197,7 +197,6 @@ char* Arena::AllocateNewBlock(size_t block_bytes) {
    // this way the insertion into the vector below will not throw and we
    // won't leak the allocated memory in that case. if reserve() throws,
    // we won't leak either
    -  blocks_.reserve(blocks_.size() + 1);
    
    char* block = new char[block_bytes];
    size_t allocated_size;

We still have a fork of MyRocks with many changes, but that is not the one in this test.

mdcallag commented 6 years ago

This bug report started with "The loading speed is greatly slower than InnoDB." Then after much back and forth you provide results where MyRocks gets about 90% of the throughput compared to InnoDB. Do you mean that 10% slower is "greatly slower"?

Next time be specific up front when you file a bug report because 10% slower isn't "greatly slower" to most people.

Note that this is an in-memory workload.

mdcallag commented 6 years ago

Repeated tests with fsync off, tuned my.cnf for InnoDB, default my.cnf for MyRocks using both latin1_bin (that I used for the results above) and utf8/utf8_bin (that you mention above). MyRocks is faster for both. The binlog was enabled for the first two tests, and for the results I shared above. Disabling it makes things slightly faster. When enabled, fsync was disabled for it.

I am also using upstream (unmodified) MySQL 5.6.35 for InnoDB. While you get it from the FB branch. From past tests InnoDB 5.6 from the FB branch is faster than from upstream 5.6.

I also used performance_schema=1 for InnoDB, but perf schema is disabled for MyRocks. I don't need it with the FB branch because we have user_statistics & table_statistics. I do need perf schema with upstream MySQL to get something similar to user/table stats.

MyRocks InnoDB
407     572     latin1_bin
469     663     utf8/utf8_bin
442     615     utf8/utf8_bin, skip_log_bin
NA      605     utf8/utf8_bin, skip_log_bin, performance_schema=0