couchbase / forestdb

A Fast Key-Value Storage Engine Based on Hierarchical B+-Tree Trie
Apache License 2.0
1.29k stars 172 forks source link

Improve MemoryPool locality #12

Closed arthurprs closed 7 years ago

arthurprs commented 8 years ago

Improve cache locality by using a stack for indexes and a continuous block of memory.

Although it now requires a multiplication (at least preserving this API) the operations on the underlining vector are faster and have better locality than the previous underlining deque.

hisundar commented 8 years ago

Thanks Arthur. We have added a simple unit test for this at https://github.com/couchbase/forestdb/blob/master/tests/unit/mempool_test.cc but what we observe is that with your patch, there is nearly a 2X slowdown compared to the existing implementation with std::queue. I am not sure if this slowdown is due to the multiplication overhead or from the underlying STL implementation or something else. I am on OS X with AppleClang 7.3.0.7030031 Please feel free to add an unit test if you find that the stack based implementation shows better behavior with locality.

arthurprs commented 8 years ago

That's very good, I wanted some tests while writing this. Although, I think they're not really good as 97+% of the cpu time is spend inside rand() and the fill pattern it's not deterministic.

I updated the code a bit (second commit in this PR) trying to make it more repeatable and a bit less artificial. I had to increase the iterations a lot to mask the noise from spawning/joining the threads and the fill amount shouldn't be too long otherwise it's impossible to measure what we really want to measure.

stack

➜  build git:(master) ✗ tests/unit/mempool_test
basic test with 500000 runs of 8 bins x 1048576bytes in 391647 us PASSED
multi-thread test with 8 threads each with 500000 runs of 8 bins x 1048576bytes in 1107800 usec PASSED

queue

➜  build git:(bebf491) ✗ tests/unit/mempool_test
basic test with 500000 runs of 8 bins x 1048576bytes in 437160 us PASSED
multi-thread test with 8 threads each with 500000 runs of 8 bins x 1048576bytes in 1199590 usec PASSED

This is probably the best case scenario for deque as most stdlib implementations will fit 8 items per node just fine and won't cause any malloc/free. Also, it hardly stress locality, as the benchmark quickly puts everything into L3 and L2.

hisundar commented 8 years ago

Thanks for the patch, Arthur! If this is your first code contribution to Couchbase, you'll need to fill out our Contributor License Agreement (http://review.couchbase.org/#/settings/agreements), which basically just declares that you created the code and are allowing us to release it under the Apache license.

arthurprs commented 8 years ago

I just filled it and rebased the patch.

arthurprs commented 8 years ago

Are you still interested in this? If yes I can rebase it and fix the conflicts.

hisundar commented 8 years ago

Yes, we are, sorry for the delay, Arthur. As it turns out I don't have the permission to merge pull requests. But I have notified the ones who do. Thanks again

arthurprs commented 8 years ago

I've rebased the commits on top of master.