TOP_KEYS feature fixes - Githubissues

GoogleCodeExporter commented 9 years ago

I did some real basic bench testing against the TOP_KEYS feature while ... 
working on the benchmark stuff.

The basic results:

Normal:   400,000 gets/sec
TOP_KEYS: 200,000 gets/sec

8 threads were used. increasing thread count didn't increase performance. 
Appears to be stacking on locks and spending a lot of time in sprintf.

Arguments I've heard about this being okay:

- memcached is so fast that it doesn't matter if you cut the capacity in half
- users desperately need this so it's worth cutting capacity in half

Arguments I have against it:

- there will be things we do in the future that will slow it down, and ideally 
we won't be putting ourselves in a position where you enable all of these 
features at once and end up halving the capacity several times. Performance 
reduction via in-line features should be strictly limited to features that 
cannot be implemented outside of the daemon (ie; via tcpdump and a 
script/program).

That said, I'm all for shipping a script or C app with memcached for doing 
"topkeys-like" quick analysis.

- I dunno, that's basically it.

Approaches for fixing the issue:

- I'd prefer to tear it out and develop it in a branch, then add it back in 
during 1.6.1 if it's fixable, or relegate it to a module or engine extension 
and leave it out.

- The feature samples all keys, and is enabled at start time via an environment 
variable. The only user friendly bits about the whole thing is the information 
it eventually gives you. Math and usability rules don't back up why this 
feature is the way it is; It should be a sampling set (defaulting to 0, 
changeable at runtime). One sample every 1,000+ commands on a busy server is 
probably 10x as frequent as it needs to be to find the "top keys".

- I *hate* this thing. This is maybe 1/3rd of the useful information you can 
get out of a key stream, and it's not in any form that could be extensible. If 
we distribute another fast C based app (possibly start from perl or whatever), 
you can find "top keys" via snapshots (the data only matters when you're 
looking at it), and you can discern patterns from top related keys. The tools I 
use to track down keys are often customized to look for common sections of keys 
to see if particular features are going off-kilter.

So in short, either way you need a key-stream analysis method to actually get 
useful information out of a running instance. Providing anything else without a 
method of getting the full picture is just flatly missleading.

Original issue reported on code.google.com by dorma...@rydia.net on 15 Apr 2011 at 7:23

GoogleCodeExporter commented 9 years ago

As an alternative, including sFlow instrumentation has minimal overhead 
(approximately the cost of adding one more performance counter), but provides 
full details about keys and operations allowing top keys, missed keys etc. to 
be monitored.

sFlow reduces the overhead on the Memcached server by exporting a random sample 
of memcache operations to a remote collector. The collector receives samples 
from every server in the cluster and calculates top keys etc. This architecture 
is extremely scalable and flexible. It is only the collector that needs to be 
modified in order to calculate additional statistics (like top keys with a 
particular prefix and operation etc). Since sFlow is also used to monitor 
network traffic and server performance, the sFlow collector can put the 
information together to provide a comprehensive view of cluster performance.

The patch needed to add sFlow to Memcached is in ticket and could easily be 
ported to the 1.6 branch:

http://code.google.com/p/memcached/issues/detail?id=157

For more information on how sFlow works, and the type of data you can get, see:

http://blog.sflow.com/search/label/Memcache

Original comment by peter.ph...@gmail.com on 15 Apr 2011 at 2:19

GoogleCodeExporter commented 9 years ago

Neil came by at the last hackathon, and I think we talked through how an engine 
similar to bucket engine[1] could be implemented to provide the sFlow 
extensibility.  I don't know what the current thoughts are on issue 157 but I 
think we were suggesting the engine as the best path.  

This whole thing is different than TOP_KEYS.  TOP_KEYS works with existing 
protocol and clients without needing extra or external tools.

[1] https://github.com/membase/bucket_engine

Original comment by ingen...@gmail.com on 15 Apr 2011 at 6:05

GoogleCodeExporter commented 9 years ago

It's worth thinking about why measurements like TOP_KEYS are important. This 
type of measurement is there to help improve performance. If the TOP_KEYS 
function kills performance by 50%, it's hard to see the justification for 
turning it on since the feature provides limited data and is difficult to 
manage in an operational setting (it can't easily be enabled, disabled or 
reconfigured).

It is convenient to have the measurements calculated by the server and made 
available through the memcache protocol, but that convenience comes at a huge 
cost. Shifting the analysis away from the servers means that you get a great 
deal more flexibility, with minimal overhead on the server. For example, in 
addition to reporting TOP_KEYS, you can analyze sFlow data to report on top 
missed keys - very helpful for improving cache hit rates. Calculating 
additional metrics using sFlow involves no additional work on the servers, 
whereas each time you add an additional metric like TOP_KEYS on the server you 
cut performance by an additional 50%.

In any case, it is likely that you would use an external application to analyze 
the performance metrics and produce charts and reports regardless of whether 
memcache or sFlow is used to transport the metrics.

What is the overhead of inserting an extra engine in the chain? My concern 
would be that the cost of adding the instrumentation as a module might be high 
- reducing the value of the instrumentation. The optimal location for sFlow 
would be in the protocol engine where the counters are updated, since the sFlow 
hook in the performance path essentially involves maintaining one additional 
counter.

Original comment by peter.ph...@gmail.com on 15 Apr 2011 at 7:25

GoogleCodeExporter commented 9 years ago

Hi all,

Yes,  I showed up for the hackathon but was too lazy to stay all night and 
actually do the work :)

I guess I was hesitating because it wasn't clear if anyone was going to try it. 
  I didn't want to write an sFlow engine-shim just to commit it to the void.  
If there is a real desire to see this problem solved,  and there is consensus 
that sFlow's "random-sampling with immediate forwarding" approach is the best 
way to do it,  then I'm happy to go ahead.  It certainly seems like there is 
now a clearer understanding of the need to do this without impacting 
performance,  so perhaps the time is right?

So to summarize,  the questions are:
(1)  "If I write this will you test it?"  and
(2)  "if it works great,  will you bundle it with the default download?"

It may help to know that there are a number of freeware tools out there that 
can receive and process sFlow in various ways,  and there are also a number of 
other sFlow agents that are also free and open-source.  Here are some examples:
http://ganglia.sourceforge.net
http://mod-sflow.googlecode.com
http://host-sflow.sourceforge.net
http://www.inmon.com/technology/sflowTools.php

(and of course there is also overwhelming support for this approach on the 
network equipment side:
http://sflow.org/products/network.php)

Anticipating where this may lead,  I think the big carrot is that down the line 
you may find you can remove the top-keys code from the default engine and clean 
up the critical path a little.

Thoughts?

Neil

Original comment by neil.mck...@gmail.com on 15 Apr 2011 at 8:51

GoogleCodeExporter commented 9 years ago

It's hard to predict what the adoption will be.  Having something entirely 
optional such that the code isn't even the binary unless someone is interested 
in it is great, though.

Original comment by dsalli...@gmail.com on 16 Apr 2011 at 1:11

GoogleCodeExporter commented 9 years ago

Having sflow support would be nice (probably pretty great, even), but as an 
option to folks who want sflow. I think it's orthogonal to this thing.

topkeys has the usability right, but the math wrong.
sflow has the math (closer), but the usability wrong (for this particular need 
we have).

I imagined a similar feature a few years ago after talking with a former NDB 
guy and looking at varnish, which would somewhat intrinsically allow sflow and 
avoid tcpdump... It's stupid to describe it here, but in short it's that 
pattern where you write logs into a ringbuffer and allow listeners to stream 
the logs (+ a runtime tunable sampling ratio). So "topkeys" ends up being 
closer to what varnishlog/varnishhist/varnishtop are. I love those utilities so 
much I wanted them for memcached, but not enough to actually write the feature 
myself... yet...

So all I can do is grandstand at someone who's already written a feature that's 
partway there? Please understand that I do feel pretty stupid pushing back on 
this already working thing. However, it smells like a customer feature and we 
*know* there are better ways of doing this. Is the only reason to keep it 
exactly the way it is because it's already done and you have customers who rely 
on it?

Remember it's very very hard to change something like this after the fact. I'd 
rather be in varnish's position at the end of the day.

Original comment by dorma...@rydia.net on 16 Apr 2011 at 7:19

GoogleCodeExporter commented 9 years ago

If I have understood correctly,  I think you might want both features.   A 
ring-buffer than clients can connect to and stream back using server-side 
filtering and sampling might be great for troubleshooting a single node.   
However the sFlow feature is aimed more at continuous operational monitoring of 
a whole cluster.  If you have 1000 nodes in your cluster you wouldn't really 
want to open 1000 TCP connections from the monitoring client, so UDP logging 
makes more sense (especially if you are sampling anyway - lost packets are just 
an unintended adjustment to the sampling rate).  With UDP logging it's more 
natural to pack the fields into a structure than to send a delimited ASCII 
string,  so that's how you end up with the XDR encoding that sFlow uses.  It's 
easy enough to turn it back into an ASCII stream at the client if you want to.  
That's what the "sflowtool" freeware does.  I guess you could think of the UDP 
as an efficient mechanism to multiplex all the samples together from all the 
nodes.

I'm persuaded that there is at least some curiosity here,   so I'll have a go 
at adding the engine shim.  (Just finishing a module for nginx first...).   
Dustin persuaded me that the engine shim will only add one indirect function 
call to the critical path.  That's a lot more cycles than the "if(--skip==0)" 
we had before,  but my guess is that it still won't noticeably impact 
performance(?)   Perhaps we can revisit that in due course.

Original comment by neil.mck...@gmail.com on 17 Apr 2011 at 9:29

GoogleCodeExporter commented 9 years ago

OK!  When you have a moment please try this:

https://github.com/sflow-nhm/memcached

./configure --enable-sflow

I forked from the "engine" branch,  and added sFlow support.  There is no 
additional locking in the critical path, so this should have almost no impact 
on performance provided the sampling 1-in-N is chosen sensibly.   (Please test 
and confirm!)

In addition to cluster-wide top-keys, missed-keys etc. you also get 
microsecond-resolution response-time measurements;  the value-size in bytes for 
each sampled operation and the layer-4 socket.   So the sFlow collector may 
choose to correlate results by client IP/subnet/country as well as by cluster 
node or any function of the sampled keys.

The logic is best described by the daemon/sflow_mc.h file where the steps are 
captured as macros so that they can be inserted in the right places in 
memcached.c with minimal source-code footprint.  The sflow_sample_test() fn is 
called at the beginning of each ascii or binary operation,  and it tosses a 
coin to decide whether to sample that operation.  If so,  it just records the 
start time.  At the end of the transaction,  if the start_time was set then the 
sFlow sample is encoded and submitted to be sent out.

To configure for 1-in-5000,  edit /etc/hsflowd.auto to look like this:

rev_start=1
polling=30
sampling.memcache=5000
agentIP=10.211.55.4
collector=127.0.0.1 6343
rev_end=1

Inserting correct IP address for agentIP. 

If you compile and run "sflowtool" from the sources,  you should see the ascii 
output:
http://www.inmon.com/technology/sflowTools.php

For more background and a simple example,  see here:
http://blog.sflow.com/2010/10/memcached-missed-keys.html

The periodic sFlow counter-export is not working yet (that's what the 
polling=30 setting is for).  I think the default-engine needs to implement the 
.get_stats_block API call before that will work.  Let me know if you want me to 
try adding that.

Best Regards,
Neil

P.S.  I did try to do this as an engine-shim,  but the engine protocol is 
really a different, internal,  protocol.  There was not a 1:1 correspondence 
with the standard memcached operations.

Original comment by neil.mck...@gmail.com on 20 May 2011 at 2:43

GoogleCodeExporter commented 9 years ago

Trond's pulled this from engine-pu.  We'll do it better/externaler next time.

Original comment by dsalli...@gmail.com on 28 Sep 2011 at 8:16

Changed state: Fixed

allen8807 / memcached

TOP_KEYS feature fixes #202