Open mdcallag opened 8 years ago
Inspired by https://github.com/facebook/mysql-5.6/issues/184
I assume it's a known issue that lots of modifications within single transaction uses lots of CPU time at WriteBatchWithIndex's SkipList, and I think it's reproducible by writing simple RocksDB client code.
This is a simplified RocksDB client code to reproduce very similar Perf profiles -- https://gist.github.com/yoshinorim/619985fd114f0c8917e9 .
$ time ./simple_example insert 1000000 1000000 /tmp/simple-example
$ time ./simple_example update 1000000 1000000 /tmp/simple-example
On my machine insert finished in ~9 seconds, update took ~25 seconds, and most of the time were spent by WriteBatch related functions.
21.99% simple_example [.] rocksdb::ReadRecordFromWriteBatch a
14.14% simple_example [.] rocksdb::WriteBatchEntryComparator::CompareKey a
14.14% simple_example [.] rocksdb::WriteBatchEntryComparator::operator() a
11.52% libc-2.20.so [.] __memcmp_sse4_1 a
8.90% simple_example [.] rocksdb::ReadableWriteBatch::GetEntryFromDataOffset
Whether it is known, this issue makes it easier for others to understand the problem even if we close it as "won't fix". I don't think it is as widely known as you and we have to document this for our users.
I am not ready to give up on this problem yet especially if we want to have broader adoption. Can someone figure out whether we can make the code paths more efficient?
There is a another problem. When the huge write batch is written into RocksDB there are ~5 second stalls on commits from concurrent sessions. That stall is a function of the write batch, so maybe this would be a 10 second commit stall for a 2M row write batch.
This was running in the background to measure concurrent commit performance. It gets stalled by the huge write batches.
bin/mysql -uroot -ppw mysqlslap -e "CREATE TABLE a2 (a int, b int);INSERT INTO a2 VALUES (1,1)"
while :; do time bin/mysqlslap -uroot -ppw --delimiter=";" --query="insert into a2 values (1,1)" --iterations=1000 ; done
@yoshinorim is there a way to get the full call stack in the perf output? I'm curious from which code path these functions are called.
you mean "perf record -ag ..."?
On Wed, Mar 2, 2016 at 6:44 PM, Siying Dong notifications@github.com wrote:
@yoshinorim https://github.com/yoshinorim is there a way to get the full call stack in the perf output? I'm curious from which code path these functions are called.
— Reply to this email directly or view it on GitHub https://github.com/facebook/mysql-5.6/issues/194#issuecomment-191551445.
Mark Callaghan mdcallag@gmail.com
@siying Here are example stack traces I got from the above simple-example bench. RocksDB revision was the following.
commit deb08b8226a6e54038c48d62394c585a94a7f181
Author: Omar Sandoval <osandov@fb.com>
Date: Wed Mar 2 10:34:14 2016 -0800
Put:
#0 0x000000000055fd24 in rocksdb::ReadRecordFromWriteBatch(rocksdb::Slice*, char*, unsigned int*, rocksdb::Slice*, rocksdb::Slice*, rocksdb::Slice*) (n=8, this=0x7ffe1acd4640) at ./include/rocksdb/slice.h:73
#1 0x000000000055fd24 in rocksdb::ReadRecordFromWriteBatch(rocksdb::Slice*, char*, unsigned int*, rocksdb::Slice*, rocksdb::Slice*, rocksdb::Slice*) (result=0x7ffe1acd46d0, input=0x7ffe1acd4640) at ./util/coding.h:243
#2 0x000000000055fd24 in rocksdb::ReadRecordFromWriteBatch(rocksdb::Slice*, char*, unsigned int*, rocksdb::Slice*, rocksdb::Slice*, rocksdb::Slice*) (input=input@entry=0x7ffe1acd4640, tag=tag@entry=0x7ffe1acd463b "\001", column_family=column_family@entry=0x7ffe1acd463c, key=key@entry=0x7ffe1acd46d0, value=value@entry=0x7ffe1acd46f0, blob=0x7ffe1acd4710) at db/write_batch.cc:210
#3 0x0000000000586728 in rocksdb::ReadableWriteBatch::GetEntryFromDataOffset(unsigned long, rocksdb::WriteType*, rocksdb::Slice*, rocksdb::Slice*, rocksdb::Slice*) const (this=<optimized out>, data_offset=<optimized out>, type=type@entry=0x7ffe1acd46cc, Key=Key@entry=0x7ffe1acd46d0, value=value@entry=0x7ffe1acd46f0, blob=0x7ffe1acd4710) at utilities/write_batch_with_index/write_batch_with_index_internal.cc:46
#4 0x0000000000586a88 in rocksdb::WriteBatchEntryComparator::operator()(rocksdb::WriteBatchIndexEntry const*, rocksdb::WriteBatchIndexEntry const*) const (this=0x78f7e0, entry1=0x9be11b8, entry2=0x9be0ee8) at utilities/write_batch_with_index/write_batch_with_index_internal.cc:94
#5 0x0000000000585cfa in rocksdb::SkipList<rocksdb::WriteBatchIndexEntry*, rocksdb::WriteBatchEntryComparator const&>::FindGreaterOrEqual(rocksdb::WriteBatchIndexEntry* const&) const (this=0x790098, n=<optimized out>, key=@0x9be06b0: 0x9be0ee8) at ./db/skiplist.h:280
#6 0x0000000000585cfa in rocksdb::SkipList<rocksdb::WriteBatchIndexEntry*, rocksdb::WriteBatchEntryComparator const&>::FindGreaterOrEqual(rocksdb::WriteBatchIndexEntry* const&) const (this=this@entry=0x790098, key=@0x7ffe1acd47d0: 0x7ffe1acd47e0) at ./db/skiplist.h:297
#7 0x0000000000583771 in rocksdb::WriteBatchWithIndex::Rep::UpdateExistingEntryWithCfId(unsigned int, rocksdb::Slice const&) (target=@0x7ffe1acd47d0: 0x7ffe1acd47e0, this=<synthetic pointer>) at ./db/skiplist.h:247
#8 0x0000000000583771 in rocksdb::WriteBatchWithIndex::Rep::UpdateExistingEntryWithCfId(unsigned int, rocksdb::Slice const&) (key=..., this=<synthetic pointer>) at utilities/write_batch_with_index/write_batch_with_index.cc:331
#9 0x0000000000583771 in rocksdb::WriteBatchWithIndex::Rep::UpdateExistingEntryWithCfId(unsigned int, rocksdb::Slice const&) (this=this@entry=0x78f7c0, column_family_id=0, key=...) at utilities/write_batch_with_index/write_batch_with_index.cc:425
#10 0x0000000000583edd in rocksdb::WriteBatchWithIndex::Rep::AddOrUpdateIndex(rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&) (key=..., column_family_id=<optimized out>, this=0x78f7c0) at utilities/write_batch_with_index/write_batch_with_index.cc:446
#11 0x0000000000583edd in rocksdb::WriteBatchWithIndex::Rep::AddOrUpdateIndex(rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&) (key=..., column_family=0x0, this=0x78f7c0) at utilities/write_batch_with_index/write_batch_with_index.cc:415
#12 0x0000000000583edd in rocksdb::WriteBatchWithIndex::Rep::AddOrUpdateIndex(rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&) (this=0x78f7c0, column_family=0x0, key=...) at utilities/write_batch_with_index/write_batch_with_index.cc:440
#13 0x0000000000580fc3 in rocksdb::TransactionBaseImpl::Put(rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::Slice const&) (this=0x78f6b0, column_family=0x0, key=..., value=...) at utilities/transactions/transaction_base.cc:246
#14 0x00000000004a724d in rocksdb::TransactionBaseImpl::Put(rocksdb::Slice const&, rocksdb::Slice const&) (this=<optimized out>, key=..., value=...) at ./utilities/transactions/transaction_base.h:101
#15 0x000000000040baed in main ()
GetForUpdate:
#4 0x00000000005868c1 in rocksdb::WriteBatchEntryComparator::CompareKey(unsigned int, rocksdb::Slice const&, rocksdb::Slice const&) const (this=0x78f7e0, column_family=0, key1=..., key2=...) at utilities/write_batch_with_index/write_batch_with_index_internal.cc:128
#5 0x00000000005869e6 in rocksdb::WriteBatchEntryComparator::operator()(rocksdb::WriteBatchIndexEntry const*, rocksdb::WriteBatchIndexEntry const*) const (this=0x78f7e0, entry1=0xc61c058, entry2=0xc61bf08) at utilities/write_batch_with_index/write_batch_with_index_internal.cc:114
#6 0x0000000000585cfa in rocksdb::SkipList<rocksdb::WriteBatchIndexEntry*, rocksdb::WriteBatchEntryComparator const&>::FindGreaterOrEqual(rocksdb::WriteBatchIndexEntry* const&) const (this=0x790098, n=<optimized out>, key=@0xc61b710: 0xc61bf08) at ./db/skiplist.h:280
#7 0x0000000000585cfa in rocksdb::SkipList<rocksdb::WriteBatchIndexEntry*, rocksdb::WriteBatchEntryComparator const&>::FindGreaterOrEqual(rocksdb::WriteBatchIndexEntry* const&) const (this=0x790098, key=@0x7ffe1acd4578: 0x7ffe1acd4580) at ./db/skiplist.h:297
#8 0x0000000000586090 in rocksdb::WBWIIteratorImpl::SeekToLast() (target=@0x7ffe1acd4578: 0x7ffe1acd4580, this=<optimized out>) at ./db/skiplist.h:247
#9 0x0000000000586090 in rocksdb::WBWIIteratorImpl::SeekToLast() (this=0xc621880) at utilities/write_batch_with_index/write_batch_with_index.cc:321
#10 0x0000000000586c80 in rocksdb::WriteBatchWithIndexInternal::GetFromBatch(rocksdb::DBOptions const&, rocksdb::WriteBatchWithIndex*, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::MergeContext*, rocksdb::WriteBatchEntryComparator*, std::string*, bool, rocksdb::Status*) (options=..., batch=batch@entry=0x78f710, column_family=column_family@entry=0x76bdc0, key=..., merge_context=merge_context@entry=0x7ffe1acd4780, cmp=0x78f7e0, value=0x7ffe1acd4790, overwrite_key=true, s=0x7ffe1acd47a0) at utilities/write_batch_with_index/write_batch_with_index_internal.cc:168
#11 0x0000000000583106 in rocksdb::WriteBatchWithIndex::GetFromBatchAndDB(rocksdb::DB*, rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, std::string*) (this=0x78f710, db=0x6b1600, read_options=..., column_family=0x76bdc0, key=..., value=0x7ffe1acd4910) at utilities/write_batch_with_index/write_batch_with_index.cc:683
#12 0x000000000057c227 in rocksdb::TransactionBaseImpl::Get(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, std::string*) (this=<optimized out>, read_options=..., column_family=<optimized out>, key=..., value=<optimized out>) at utilities/transactions/transaction_base.cc:169
#13 0x000000000057c177 in rocksdb::TransactionBaseImpl::GetForUpdate(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, std::string*) (this=this@entry=0x78f6b0, read_options=..., column_family=0x76bdc0, key=..., value=value@entry=0x7ffe1acd4910) at utilities/transactions/transaction_base.cc:178
#14 0x00000000004a7210 in rocksdb::TransactionBaseImpl::GetForUpdate(rocksdb::ReadOptions const&, rocksdb::Slice const&, std::string*) (this=0x78f6b0, options=..., key=..., value=0x7ffe1acd4910) at ./utilities/transactions/transaction_base.h:62
#15 0x000000000040ba8d in main ()
@mdcallag yes, "-ag" will work. @yoshinorim any idea which of the two cases is more? Even better would be running "perf -ag" and show the call graph for top items.
Summary:
Times per operation. These are from the 'time mysql ...' statements in the testcase. The MyRocks tests were run twice, first with the my.cnf settings listed below ("default") and then with rocksdb_skip_unique_check=1 ("skip").
Testcase
RocksDB my.cnf options: