Support AWS ElastiCache

GoogleCodeExporter commented 9 years ago

Amazon ElastiCache provides a managed memcached service, but due to its elastic 
nature, node list can vary in time and for that purpose they provide an auto 
discovery endpoint. It would be nice if PageSpeed supports ElastiCache instead 
of each of us designing a self-modifying dynamic configuration that monitors 
the autodiscovery endpoint, rebuilds nginx.conf upon detected change, and 
restart Nginx to make the new node list in effect.

For more info: 
http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/AutoDiscovery.html

(Originally reported as https://github.com/pagespeed/ngx_pagespeed/issues/544)

Original issue reported on code.google.com by jefftk@google.com on 21 Oct 2013 at 2:18

GoogleCodeExporter commented 9 years ago

There does not appear to be a C++ API to ElastiCache.  It would be nicer anyway 
if there were a memcached protocol proxy that could do this work.

Restarting nginx periodically to account for ElastiCache reconfigurations 
sounds cumbersome.

How often to the ElastiCache configurations change?

Original comment by jmara...@google.com on 21 Oct 2013 at 2:34

GoogleCodeExporter commented 9 years ago

Restarting nginx is the workaround, not the suggested fix.

It looks like what they want is called "ElastiCache Node Auto Discovery", and 
the doc is: 
http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/AutoDiscovery.html
#AutoDiscovery.ModifyApp

It looks like they have php and java apis.  I can't find the source code or 
protocol documentation, though.

Original comment by jefftk@google.com on 21 Oct 2013 at 2:49

GoogleCodeExporter commented 9 years ago

The autodiscovery is in fact very simple. I will provide you documentation on 
the protocol (once I get it myself), but it's pretty simple and once you get 
the list of nodes, you just use the normal memcached protocol. I can donate a 
small cluster for you to develop and test.

Original comment by nikolaynkolev on 21 Oct 2013 at 3:26

GoogleCodeExporter commented 9 years ago

I suspect that spontaneously changing the set of memcached nodes connected to a 
running apache or nginx server, which are both multi-process architectures, is 
going to be difficult.

The fastest way to solve this problem, IMO, is via a proxy that speaks normal 
memcached out one side, and connects to the elastic service on the other.

Original comment by jmara...@google.com on 21 Oct 2013 at 3:35

GoogleCodeExporter commented 9 years ago

There are proxies, but they increase latency. The whole idea of adopting 
ElastiCache and I'm sure other services may adopt the same autodiscovery 
protocol, too, is that you still talk to the nodes directly without any 
overhead.

I did found information on the protocol. It's quite simple indeed. You send 
"config get cluster\r\n" get command to the autodiscovery endpoint and then you 
receive lines of data with "END\r\n" at the end. The first line is probably 
some response identifier (I will find out in a bit), the second is the 
memcached version number. The third is a space-delimited list of 
<host>|<ip>|<port> for each active memcached node.

Original comment by nikolaynkolev on 21 Oct 2013 at 3:45

GoogleCodeExporter commented 9 years ago

A quick search for "memcached proxy" indicates this is a thing.  It looks like 
other folks want a proxy that does automated discovery for ElastiCache: 
https://forums.aws.amazon.com/message.jspa?messageID=420734#

Original comment by jmaes...@google.com on 21 Oct 2013 at 3:47

GoogleCodeExporter commented 9 years ago

There's a good memcached/Redis proxy by Twitter 
(https://github.com/twitter/twemproxy), but it over-complicates your 
infrastructure and is too expensive for smaller projects. For bigger projects, 
for redundancy, you need two of those and, yet, it cannot beat by speed the 
direct approach. Again, ElastiCache is a major-enough service to be supported 
and the autodiscovery is a very simple memcached-protocol-compliant extension. 
I haven't done C++ programming recently, maybe I can give it a try, but the 
algorithm is pretty simple. You still maintain a list of memcached nodes in 
case you cannot connect to one, you refresh the list of nodes from the 
autodiscovery endpoint and retry with some retry limit. Of course, you refresh 
the list of nodes at startup and you can also refresh every N seconds as well.

Original comment by nikolaynkolev on 21 Oct 2013 at 4:05

GoogleCodeExporter commented 9 years ago

It would be useful to look at data: how much latency would a local proxy add?  
I think it should add < 1ms.

Also consider the multi-process architecture of Apache.  Each process -- and 
there can be dozens or more -- on each server will have to maintain connections 
to each backend.  I'm not sure how many backends an Elasticache would have but 
this is an NxM equation and we don't have total control of N (the number of 
apache or nginx processes) as they can grow with load.

From an implementation perspective we'd need to coordinate the new 
memcached-host-set between multiple processes, each with their own connections.

A proxy might also be able improve performance at the protocol level.  It could 
batch together multiple GET requests coming from the different apache/nginx 
subprocesses and thus reduce round-trips to the memcached backends.

I'm not familiar with the twitter proxy.  In what way is it more expensive for 
smaller projects?

Original comment by jmara...@google.com on 21 Oct 2013 at 5:37

GoogleCodeExporter commented 9 years ago

I want to clarify the last comment about multiple processes, as it brings up 
another question.

I am wondering what happens to the cache semantics when the configuration 
changes.  It seems like this might be effectively a cache flush, depending on 
how the entries are sharded across memcached instances.  Spontaneous cache 
reconfiguration might have negative effects on performance.

See the remarks for the Apache runtime library documentation for 
apr_memcache_add_server in 
http://apr.apache.org/docs/apr/2.0/group___a_p_r___util___m_c.html#ga9d6381d9d9f
8f83f4fa5450cc8066590

    Adding servers is not thread safe, and should be done once at startup.
    Warning:
    Changing servers after startup may cause keys to go to different servers.

Looks like a proxy is the only way we could get this to work at all in 
Apache/nginx (where we also use the apache memcache interface).  And even with 
a proxy, a cache reconfiguration will force a flush if we are not careful.  The 
proxy could in theory design its sharding using a better algorithm than 
apr_memcache that when resizing it won't evict all the entries.

What's the use-case for dynamically changing the hash configuration?  How 
frequently would this occur?

Original comment by jmara...@google.com on 21 Oct 2013 at 5:58

Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

I wasn't aware that PageSpeed has a separate instance per each httpd process. 
My usecase is, in fact, Nginx - this is where I originally filed the feature 
request with Jeff not thinking that this functionality is actually in PSOL.

The change can occur either planned (sizing up/down your cache cluster) or 
unplanned when AWS rebuilds unhealthy nodes.

Original comment by nikolaynkolev on 21 Oct 2013 at 6:28

GoogleCodeExporter commented 9 years ago

I think for the purposes of this feature, nginx and Apache are identical.  We 
are using the same C library in either case, and changing the server set is not 
thread-safe after startup.

I think a proxy is necessary to deal with the dynamically changing server-sets 
by sharding using a consistent hash function: 
http://en.wikipedia.org/wiki/Consistent_hashing.

I am curious to know whether the Twitter proxy does this.  Then it would just 
be a question of teaching it the Amazon protocol.

Original comment by jmara...@google.com on 21 Oct 2013 at 7:25

GoogleCodeExporter commented 9 years ago

> for the purposes of this feature, nginx and Apache are identical

With nginx the NxM problem is a 1xM problem because there's only one process, 
so not as bad.

Original comment by jefftk@google.com on 22 Oct 2013 at 2:03

GoogleCodeExporter commented 9 years ago

Sorry, I'm wrong.  Nginx generally has multiple worker processes, each with a 
connection to each memcache.  So it's still NxM, though with typically a very 
low N.

Original comment by jefftk@google.com on 22 Oct 2013 at 2:43

apache / incubator-pagespeed-mod

Support AWS ElastiCache #806