basho / bitcask

because you need another a key/value storage engine
1.28k stars 171 forks source link

Avoid file_server serialized functions #85

Closed slfritchie closed 10 years ago

slfritchie commented 11 years ago

For maximum parallelism, Bitcask should be avoiding use of file I/O-related functions that end up calling the local file_server process.

For example, bitcask:needs_merge() calls filelib:is_file(), which is handled by and serialized by a single Erlang process: registered name file_server_2, code module file_server.

IIRC, the Hibari code uses some prim_file hackery to avoid sending file I/O to the local file_server_2 proc.

slfritchie commented 11 years ago

For example, using Erlang tracing to watch entry & return of bitcask:needs_merge():

{2013,3,19} {3,13,26}: bitcask needs_merge: 113 msec
{2013,3,19} {3,14,18}: bitcask needs_merge: 116 msec
{2013,3,19} {3,14,18}: bitcask needs_merge: 114 msec
{2013,3,19} {3,14,23}: bitcask needs_merge: 101 msec
{2013,3,19} {3,15,40}: bitcask needs_merge: 104 msec
{2013,3,19} {3,15,40}: bitcask needs_merge: 106 msec
{2013,3,19} {3,16,4}: bitcask needs_merge: 121 msec
{2013,3,19} {3,16,40}: bitcask needs_merge: 101 msec
{2013,3,19} {3,17,18}: bitcask needs_merge: 108 msec
{2013,3,19} {3,17,18}: bitcask needs_merge: 108 msec
{2013,3,19} {3,18,40}: bitcask needs_merge: 115 msec
{2013,3,19} {3,18,40}: bitcask needs_merge: 118 msec
{2013,3,19} {3,19,26}: bitcask needs_merge: 145 msec
stefan-mees commented 11 years ago

We have the same behavior/issue in our cluster, thats also our explanation why response times increase over time (increased amount of bitcask files per vnode over time).

The problem gets worse because bitcask:needs_merge() is scheduled for all vnodes at roughly the same time (a possible solution would be to schedule the first execution of needs_merge with a random offset per vnode). You can see that the peaks int the response time graph follow exactly the 3 minute interval of bitcask:needs_merge (http://puu.sh/35ezR.png).

We see the same behavior (blocking vnodes processes, huge increase of the response times) when calling riak-admin vnode-status on a node which has a lot of bitcask files per vnode (which leads in the end to the same issue)

https://github.com/basho/bitcask/blob/07b24e80f6e7f1b3407eea198c4ea5260d6b2d99/src/bitcask.erl#L641 https://github.com/basho/bitcask/blob/07b24e80f6e7f1b3407eea198c4ea5260d6b2d99/src/bitcask.erl#L809

deniszh commented 11 years ago

We faced same issue after upgrade to 1.3.1. I made rolling restart today - and ... it's a magic - PUT time was dropped from 150 ms to 2 ms! - http://s15.postimg.org/3lrl036yz/put.png

Thanks for pointing to right direction, Stefan, I hope that will be fixed soon - it's not very comfortable to do rolling restarts...

stefan-mees commented 11 years ago

Could someone from @basho give this issue some love? We have to restart our production cluster every week... we are not 100% sure that this issue is the root cause of our increasing response times over time - but for us it looks like.

please see http://lists.basho.com/pipermail/riak-users_lists.basho.com/2013-May/012183.html http://lists.basho.com/pipermail/riak-users_lists.basho.com/2013-May/012184.html

deniszh commented 11 years ago

It seems that our issue are (almost) gone after first cluster restart, current PUT 95th percetile grows to 2400 mks from 2000, but it looks OK. Although we see big PUT values during peak load time... But it will be good to know that issue is fixed :)

evanmcc commented 11 years ago

@stefan-mees I don't think that this is your issue, to be honest. I think what you're seeing there is the file monitor increasing in total reductions over time (reductions are an erlang work measurement) because it's a long-running process. 24M reductions isn't a huge number.

I'll contact you via email so hopefully we can find out what's going on with your cluster.