facebook / mysql-5.6

Facebook's branch of the Oracle MySQL database. This includes MyRocks.
http://myrocks.io
Other
2.48k stars 712 forks source link

Implement Prefetch Cache in MyRocks #705

Open yoshinorim opened 6 years ago

yoshinorim commented 6 years ago

This is a follow up task from https://github.com/facebook/mysql-5.6/issues/200. (See Sergey's comment on https://github.com/facebook/mysql-5.6/issues/200#issuecomment-203674731 for details)

This was a test to scan 15GB uncompressed MyRocks/InnoDB tables (unfragmented primary key) where all data fit in filesystem cache. MyRocks: 2 min 12.62 sec (with readahead=16mb) InnoDB: 1 min 50.36 sec InnoDB with disabling prefetch cache: 2 min 21.28 sec

mdcallag commented 6 years ago

What is a prefetch cache?

yoshinorim commented 6 years ago

Grep storage/innobase by MYSQL_FETCH_CACHE_SIZE, MYSQL_FETCH_CACHE_THRESHOLD, row_sel_prefetch_cache_init, row_sel_enqueue_cache_row_for_mysql, etc. Basic idea is reducing the number of function calls to convert from engine row format to MySQL row format, by caching multiple rows and converting in batch.

mdcallag commented 6 years ago

Without looking at the code, my guess was that the InnoDB code is there to reduce the locking/access overhead of accessing the database page, and it isn't clear to me that RocksDB has such an overhead with iterators.

Tema commented 6 years ago

@yoshinorim I'd like to start with reproducing the test. #200 describes much smaller test with 10M only. Did you run the very the same test only with much greater number of rows? Also where can I set readahead=16mb (or confirm that I have the same value) and how can I configure compressed/uncompressed tables.

Tema commented 6 years ago

@mdcallag is right, I built a flame-graph using perf tool for @yoshinorim experiment and can clearly see that with prefetch disabled MySQL starts spend 10% more in #sel_restore_position_for_mysql(#buf_page_optimistic_get) method and 3% more in mtr_commit (mini-transaction commit).

The doc also mentions that saving comes from batching the data fetch.

mdcallag commented 6 years ago

I will soon have data to show the benefit of rocksdb_advise_random_on_open=0 (default is =1). Although the feature as-is won't work for us (need to set it per session, not for all sessions).

During full index scan tests I see page-at-a-time reads in iostat output, avgrq-sz == 16kb. RocksDB can be doing large IOs for full index scans, with either posix_fadvise hints or new code in RocksDB