facebook / mysql-5.6

Facebook's branch of the Oracle MySQL database. This includes MyRocks.
http://myrocks.io
Other
2.49k stars 713 forks source link

MyRocks much slower than InnoDB for many-row transactions #194

Open mdcallag opened 8 years ago

mdcallag commented 8 years ago

Summary:

Times per operation. These are from the 'time mysql ...' statements in the testcase. The MyRocks tests were run twice, first with the my.cnf settings listed below ("default") and then with rocksdb_skip_unique_check=1 ("skip").

        default skip
innodb  myrocks myrocks query
 2.746  15.646  0.814   insert..select 512k rows on table with PK
 0.249   0.522  0.532   select count(*) 1M rows
 3.575  33.475  35.848  update unindexed 1M rows
 2.299  33.113  9.217   index 1M rows
12.399  99.489  89.840  update indexed column 1M rows
 4.138  35.247  35.001  update unindexed column 1M rows
 8.432  81.752  7.155   insert..select 1M rows on table with PK, secondary index
 0.465   1.589  1.591   select count(*) 2M rows

Testcase

bin/mysql -uroot -ppw -e "drop database if exists mysqlslap"
bin/mysql -uroot -ppw -e "create database mysqlslap"
bin/mysql -uroot -ppw mysqlslap -e "CREATE TABLE au (a int primary key auto_increment, b int, c int);INSERT INTO au VALUES (null,1,1), (null,1,1), (null,1,1), (null,1,1);"
bin/mysql -uroot -ppw mysqlslap -e "INSERT INTO au select null, 1, 1 from au"
bin/mysql -uroot -ppw mysqlslap -e "INSERT INTO au select null, 1, 1 from au"
bin/mysql -uroot -ppw mysqlslap -e "INSERT INTO au select null, 1, 1 from au"
bin/mysql -uroot -ppw mysqlslap -e "INSERT INTO au select null, 1, 1 from au"
bin/mysql -uroot -ppw mysqlslap -e "INSERT INTO au select null, 1, 1 from au"
bin/mysql -uroot -ppw mysqlslap -e "INSERT INTO au select null, 1, 1 from au"
bin/mysql -uroot -ppw mysqlslap -e "INSERT INTO au select null, 1, 1 from au"
bin/mysql -uroot -ppw mysqlslap -e "INSERT INTO au select null, 1, 1 from au"
bin/mysql -uroot -ppw mysqlslap -e "INSERT INTO au select null, 1, 1 from au"
bin/mysql -uroot -ppw mysqlslap -e "INSERT INTO au select null, 1, 1 from au"
bin/mysql -uroot -ppw mysqlslap -e "INSERT INTO au select null, 1, 1 from au"
bin/mysql -uroot -ppw mysqlslap -e "INSERT INTO au select null, 1, 1 from au"
bin/mysql -uroot -ppw mysqlslap -e "INSERT INTO au select null, 1, 1 from au"
bin/mysql -uroot -ppw mysqlslap -e "INSERT INTO au select null, 1, 1 from au"
bin/mysql -uroot -ppw mysqlslap -e "INSERT INTO au select null, 1, 1 from au"
bin/mysql -uroot -ppw mysqlslap -e "INSERT INTO au select null, 1, 1 from au"
bin/mysql -uroot -ppw mysqlslap -e "INSERT INTO au select null, 1, 1 from au"
time bin/mysql -uroot -ppw mysqlslap -e "INSERT INTO au select null, 1, 1 from au"
time bin/mysql -uroot -ppw mysqlslap -e "select count(*) from au"
time bin/mysql -uroot -ppw mysqlslap -e "UPDATE au set b = a"
time bin/mysql -uroot -ppw mysqlslap -e "alter table au add index bx(b)"
time bin/mysql -uroot -ppw mysqlslap -e "UPDATE au set b = b + 1"
time bin/mysql -uroot -ppw mysqlslap -e "UPDATE au set c = c + 1"
time bin/mysql -uroot -ppw mysqlslap -e "INSERT INTO au select null, 1, 1 from au"
time bin/mysql -uroot -ppw mysqlslap -e "select count(*) from au"

RocksDB my.cnf options:

rocksdb_max_open_files=-1
rocksdb_max_background_compactions=16
rocksdb_max_total_wal_size=4G
rocksdb_max_background_flushes=4
rocksdb_block_size=8192
rocksdb_block_cache_size=40G
rocksdb_table_cache_numshardbits=6
rocksdb_write_sync=0

# rate limiter
rocksdb_bytes_per_sync=4194304
rocksdb_wal_bytes_per_sync=4194304
# rocksdb_rate_limiter_bytes_per_sec=41943040 #40MB/s if running 4x. If running 1x, you can set ~200MB or even more.

# triggering compaction if there are many sequential deletes
rocksdb_compaction_sequential_deletes=1800
rocksdb_compaction_sequential_deletes_window=2000

rocksdb_default_cf_options=write_buffer_size=128m;target_file_size_base=32m;max_bytes_for_level_base=512m;level0_file_num_compaction_trigger=4;level0_slowdown_writes_trigger=10;level0_stop_writes_trigger=15;max_write_buffer_number=4;compression_per_level=kNoCompression:kNoCompression:kNoCompression:kZlibCompression:kZlibCompression:kZlibCompression:kZlibCompression;compression_opts=-14:1:0;block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true

# enabling row checksums for 5% keys/values
# rocksdb_store_checksums=1
# rocksdb_verify_checksums=1
# rocksdb_checksums_pct=5

default-storage-engine=rocksdb
skip-innodb
default-tmp-storage-engine=MyISAM
rocksdb

long_query_time=1
skip_log_bin
sync_binlog=0

rocksdb_allow_concurrent_memtable_write=1
rocksdb_enable_write_thread_adaptive_yield=1
mdcallag commented 8 years ago

Inspired by https://github.com/facebook/mysql-5.6/issues/184

yoshinorim commented 8 years ago

I assume it's a known issue that lots of modifications within single transaction uses lots of CPU time at WriteBatchWithIndex's SkipList, and I think it's reproducible by writing simple RocksDB client code.

yoshinorim commented 8 years ago

This is a simplified RocksDB client code to reproduce very similar Perf profiles -- https://gist.github.com/yoshinorim/619985fd114f0c8917e9 .

$ time ./simple_example insert 1000000 1000000 /tmp/simple-example
$ time ./simple_example update 1000000 1000000 /tmp/simple-example

On my machine insert finished in ~9 seconds, update took ~25 seconds, and most of the time were spent by WriteBatch related functions.

 21.99%  simple_example  [.] rocksdb::ReadRecordFromWriteBatch                                                        a
  14.14%  simple_example  [.] rocksdb::WriteBatchEntryComparator::CompareKey                                           a
  14.14%  simple_example  [.] rocksdb::WriteBatchEntryComparator::operator()                                           a
  11.52%  libc-2.20.so    [.] __memcmp_sse4_1                                                                          a
   8.90%  simple_example  [.] rocksdb::ReadableWriteBatch::GetEntryFromDataOffset 
mdcallag commented 8 years ago

Whether it is known, this issue makes it easier for others to understand the problem even if we close it as "won't fix". I don't think it is as widely known as you and we have to document this for our users.

I am not ready to give up on this problem yet especially if we want to have broader adoption. Can someone figure out whether we can make the code paths more efficient?

There is a another problem. When the huge write batch is written into RocksDB there are ~5 second stalls on commits from concurrent sessions. That stall is a function of the write batch, so maybe this would be a 10 second commit stall for a 2M row write batch.

This was running in the background to measure concurrent commit performance. It gets stalled by the huge write batches.


bin/mysql -uroot -ppw mysqlslap -e "CREATE TABLE a2 (a int, b int);INSERT INTO a2 VALUES (1,1)"

while :; do time bin/mysqlslap -uroot -ppw --delimiter=";" --query="insert into a2 values (1,1)" --iterations=1000 ; done
siying commented 8 years ago

@yoshinorim is there a way to get the full call stack in the perf output? I'm curious from which code path these functions are called.

mdcallag commented 8 years ago

you mean "perf record -ag ..."?

On Wed, Mar 2, 2016 at 6:44 PM, Siying Dong notifications@github.com wrote:

@yoshinorim https://github.com/yoshinorim is there a way to get the full call stack in the perf output? I'm curious from which code path these functions are called.

— Reply to this email directly or view it on GitHub https://github.com/facebook/mysql-5.6/issues/194#issuecomment-191551445.

Mark Callaghan mdcallag@gmail.com

yoshinorim commented 8 years ago

@siying Here are example stack traces I got from the above simple-example bench. RocksDB revision was the following.

commit deb08b8226a6e54038c48d62394c585a94a7f181
Author: Omar Sandoval <osandov@fb.com>
Date:   Wed Mar 2 10:34:14 2016 -0800

Put:

#0  0x000000000055fd24 in rocksdb::ReadRecordFromWriteBatch(rocksdb::Slice*, char*, unsigned int*, rocksdb::Slice*, rocksdb::Slice*, rocksdb::Slice*) (n=8, this=0x7ffe1acd4640) at ./include/rocksdb/slice.h:73
#1  0x000000000055fd24 in rocksdb::ReadRecordFromWriteBatch(rocksdb::Slice*, char*, unsigned int*, rocksdb::Slice*, rocksdb::Slice*, rocksdb::Slice*) (result=0x7ffe1acd46d0, input=0x7ffe1acd4640) at ./util/coding.h:243
#2  0x000000000055fd24 in rocksdb::ReadRecordFromWriteBatch(rocksdb::Slice*, char*, unsigned int*, rocksdb::Slice*, rocksdb::Slice*, rocksdb::Slice*) (input=input@entry=0x7ffe1acd4640, tag=tag@entry=0x7ffe1acd463b "\001", column_family=column_family@entry=0x7ffe1acd463c, key=key@entry=0x7ffe1acd46d0, value=value@entry=0x7ffe1acd46f0, blob=0x7ffe1acd4710) at db/write_batch.cc:210
#3  0x0000000000586728 in rocksdb::ReadableWriteBatch::GetEntryFromDataOffset(unsigned long, rocksdb::WriteType*, rocksdb::Slice*, rocksdb::Slice*, rocksdb::Slice*) const (this=<optimized out>, data_offset=<optimized out>, type=type@entry=0x7ffe1acd46cc, Key=Key@entry=0x7ffe1acd46d0, value=value@entry=0x7ffe1acd46f0, blob=0x7ffe1acd4710) at utilities/write_batch_with_index/write_batch_with_index_internal.cc:46
#4  0x0000000000586a88 in rocksdb::WriteBatchEntryComparator::operator()(rocksdb::WriteBatchIndexEntry const*, rocksdb::WriteBatchIndexEntry const*) const (this=0x78f7e0, entry1=0x9be11b8, entry2=0x9be0ee8) at utilities/write_batch_with_index/write_batch_with_index_internal.cc:94
#5  0x0000000000585cfa in rocksdb::SkipList<rocksdb::WriteBatchIndexEntry*, rocksdb::WriteBatchEntryComparator const&>::FindGreaterOrEqual(rocksdb::WriteBatchIndexEntry* const&) const (this=0x790098, n=<optimized out>, key=@0x9be06b0: 0x9be0ee8) at ./db/skiplist.h:280
#6  0x0000000000585cfa in rocksdb::SkipList<rocksdb::WriteBatchIndexEntry*, rocksdb::WriteBatchEntryComparator const&>::FindGreaterOrEqual(rocksdb::WriteBatchIndexEntry* const&) const (this=this@entry=0x790098, key=@0x7ffe1acd47d0: 0x7ffe1acd47e0) at ./db/skiplist.h:297
#7  0x0000000000583771 in rocksdb::WriteBatchWithIndex::Rep::UpdateExistingEntryWithCfId(unsigned int, rocksdb::Slice const&) (target=@0x7ffe1acd47d0: 0x7ffe1acd47e0, this=<synthetic pointer>) at ./db/skiplist.h:247
#8  0x0000000000583771 in rocksdb::WriteBatchWithIndex::Rep::UpdateExistingEntryWithCfId(unsigned int, rocksdb::Slice const&) (key=..., this=<synthetic pointer>) at utilities/write_batch_with_index/write_batch_with_index.cc:331
#9  0x0000000000583771 in rocksdb::WriteBatchWithIndex::Rep::UpdateExistingEntryWithCfId(unsigned int, rocksdb::Slice const&) (this=this@entry=0x78f7c0, column_family_id=0, key=...) at utilities/write_batch_with_index/write_batch_with_index.cc:425
#10 0x0000000000583edd in rocksdb::WriteBatchWithIndex::Rep::AddOrUpdateIndex(rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&) (key=..., column_family_id=<optimized out>, this=0x78f7c0) at utilities/write_batch_with_index/write_batch_with_index.cc:446
#11 0x0000000000583edd in rocksdb::WriteBatchWithIndex::Rep::AddOrUpdateIndex(rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&) (key=..., column_family=0x0, this=0x78f7c0) at utilities/write_batch_with_index/write_batch_with_index.cc:415
#12 0x0000000000583edd in rocksdb::WriteBatchWithIndex::Rep::AddOrUpdateIndex(rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&) (this=0x78f7c0, column_family=0x0, key=...) at utilities/write_batch_with_index/write_batch_with_index.cc:440
#13 0x0000000000580fc3 in rocksdb::TransactionBaseImpl::Put(rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::Slice const&) (this=0x78f6b0, column_family=0x0, key=..., value=...) at utilities/transactions/transaction_base.cc:246
#14 0x00000000004a724d in rocksdb::TransactionBaseImpl::Put(rocksdb::Slice const&, rocksdb::Slice const&) (this=<optimized out>, key=..., value=...) at ./utilities/transactions/transaction_base.h:101
#15 0x000000000040baed in main ()

GetForUpdate:

#4  0x00000000005868c1 in rocksdb::WriteBatchEntryComparator::CompareKey(unsigned int, rocksdb::Slice const&, rocksdb::Slice const&) const (this=0x78f7e0, column_family=0, key1=..., key2=...) at utilities/write_batch_with_index/write_batch_with_index_internal.cc:128
#5  0x00000000005869e6 in rocksdb::WriteBatchEntryComparator::operator()(rocksdb::WriteBatchIndexEntry const*, rocksdb::WriteBatchIndexEntry const*) const (this=0x78f7e0, entry1=0xc61c058, entry2=0xc61bf08) at utilities/write_batch_with_index/write_batch_with_index_internal.cc:114
#6  0x0000000000585cfa in rocksdb::SkipList<rocksdb::WriteBatchIndexEntry*, rocksdb::WriteBatchEntryComparator const&>::FindGreaterOrEqual(rocksdb::WriteBatchIndexEntry* const&) const (this=0x790098, n=<optimized out>, key=@0xc61b710: 0xc61bf08) at ./db/skiplist.h:280
#7  0x0000000000585cfa in rocksdb::SkipList<rocksdb::WriteBatchIndexEntry*, rocksdb::WriteBatchEntryComparator const&>::FindGreaterOrEqual(rocksdb::WriteBatchIndexEntry* const&) const (this=0x790098, key=@0x7ffe1acd4578: 0x7ffe1acd4580) at ./db/skiplist.h:297
#8  0x0000000000586090 in rocksdb::WBWIIteratorImpl::SeekToLast() (target=@0x7ffe1acd4578: 0x7ffe1acd4580, this=<optimized out>) at ./db/skiplist.h:247
#9  0x0000000000586090 in rocksdb::WBWIIteratorImpl::SeekToLast() (this=0xc621880) at utilities/write_batch_with_index/write_batch_with_index.cc:321
#10 0x0000000000586c80 in rocksdb::WriteBatchWithIndexInternal::GetFromBatch(rocksdb::DBOptions const&, rocksdb::WriteBatchWithIndex*, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::MergeContext*, rocksdb::WriteBatchEntryComparator*, std::string*, bool, rocksdb::Status*) (options=..., batch=batch@entry=0x78f710, column_family=column_family@entry=0x76bdc0, key=..., merge_context=merge_context@entry=0x7ffe1acd4780, cmp=0x78f7e0, value=0x7ffe1acd4790, overwrite_key=true, s=0x7ffe1acd47a0) at utilities/write_batch_with_index/write_batch_with_index_internal.cc:168
#11 0x0000000000583106 in rocksdb::WriteBatchWithIndex::GetFromBatchAndDB(rocksdb::DB*, rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, std::string*) (this=0x78f710, db=0x6b1600, read_options=..., column_family=0x76bdc0, key=..., value=0x7ffe1acd4910) at utilities/write_batch_with_index/write_batch_with_index.cc:683
#12 0x000000000057c227 in rocksdb::TransactionBaseImpl::Get(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, std::string*) (this=<optimized out>, read_options=..., column_family=<optimized out>, key=..., value=<optimized out>) at utilities/transactions/transaction_base.cc:169
#13 0x000000000057c177 in rocksdb::TransactionBaseImpl::GetForUpdate(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, std::string*) (this=this@entry=0x78f6b0, read_options=..., column_family=0x76bdc0, key=..., value=value@entry=0x7ffe1acd4910) at utilities/transactions/transaction_base.cc:178
#14 0x00000000004a7210 in rocksdb::TransactionBaseImpl::GetForUpdate(rocksdb::ReadOptions const&, rocksdb::Slice const&, std::string*) (this=0x78f6b0, options=..., key=..., value=0x7ffe1acd4910) at ./utilities/transactions/transaction_base.h:62
#15 0x000000000040ba8d in main ()
siying commented 8 years ago

@mdcallag yes, "-ag" will work. @yoshinorim any idea which of the two cases is more? Even better would be running "perf -ag" and show the call graph for top items.