Analyze/Implement Auto Scaling for managed queue (Redis)

sopel commented 10 years ago

This has been extracted from #149: We currently facilitate Redis for the inbound log queue, however, it is not designed for clustering at all:

With Redis, a dataset cannot grow beyond a single master server. Sophisticated users sometime shard their datasets on the application level, but this can get very complex and error prone. The open source community is working on “Redis Cluster”, which is designed to overcome the Redis scale-out limitation. However, there is not yet a release date for this technology. Moreover, it is already known that when available, Redis Cluster will not support many popular Redis commands or data types.

here's the mentioned Redis cluster Specification (work in progress)

:question: so the question seems to be, whether the supported feature set would be sufficient for the use case at hand, if it is available in the first place?

A related discussion in #169 has been correctly summarized by @mrdavidlaing in https://github.com/cityindex/logsearch/issues/169#issuecomment-23907109:

What we really want is a managed queue rather than a managed cache (irrespective of the underlying technology)

However, this piece of our infrastructure seems to be performing really well currently, so I think its pretty low priority.

:question: Given the goal at hand, it might be time to reconsider the latter and switch to an actual queue?

Otherwise the conclusion would be to still skip this tier for now and attend others first (i.e. Elasticsearch).

dpb587 commented 10 years ago

I think @mrdavidlaing's comment, as I understood it, is being taken out of context. I think his comment there was against moving to AWS's ElastiCache's redis implementation, not necessarily against redis in general, but because ElastiCache is not fault tolerant and would lose any queued message during an outage whereas our current implementation has the AOF to recover from.

Another idea is to see if there's some sort of redis, round-robin proxy which would make it easier to shard data without having to mess with logstash, either shipping or parsing.

That being said, an easy, cluster-friendly alternative piece of software might very well be helpful to us, although none specifically come to mind yet.

mrdavidlaing commented 10 years ago

@dpb587 - no, @sopel has my intent correct. I think that if we're going to switch out this component we should switch to something that is specifically designed to be a queue (eg, RabbitMQ).

However, our single node redis is working brilliantly currently; and can survive a restart due to the use of AOT so I don't see any short term advantage to moving away from redis just to get better clustering / fault tolerance.

dpb587 commented 10 years ago

My mistake; thanks for clarifying. I hadn't thought of RabbitMQ, I do hear about it frequently but haven't personally had the opportunity to work with it yet.

sopel commented 10 years ago

@dpb587 - circling around on Auto Scaling and fault tolerance topics: You correctly point out the AOF backing here (further detailed in https://github.com/cityindex/logsearch/issues/169#issuecomment-23895572), but I'm afraid we are not actually using that properly right now:

other than the Elasticsearch nodes, the Redis node isn't seeded via an EBS snapshot currently
this could be remedied in case of a rare EC2 instance failure by manually juggling EBS volumes, so not ideal, but a trade-off where to spent our time
:question: however, isn't there the potential for message loss during a regular deployment, simply because the queue might not be empty yet?

Put another way, given the following deployment instructions:

From EC2/Instances, find the active redis instance and [stop] it.

From EC2/Instances, wait for active redis [instance state] to become [stopped].

How do we ensure that Redis is only stopped after all messages have been consumed?

dpb587 commented 10 years ago

You make a great point. Currently our deployment assumes that we don't have an active backlog (or at least a very marginal one, whose messages do indeed get lost). I ordered the deployment instructions such that redis gets terminated first to prevent us from receiving more messages, then the logstash group to ensure their shutdown flushes their in-memory messages on to elasticsearch, then finally elasticsearch.

Using an EBS volume would ensure we do not lose any messages at all. We should probably investigate that route - an EBS for AOF isn't something I'd really considered; until now I've just accepted that we might possibly lose up to ~100 messages during a full cluster deploy.

mrdavidlaing commented 10 years ago

We could prevent the message loss by disabling inbound redis messages at the firewall level; letting logstash "drain" the redis queue, and then stopping the redis server.

mrdavidlaing commented 10 years ago

Ideally we should be in a position that we can do rolling updates to prevent data loss; which would require a cluster queue solution. Eg, Queue A + Queue B are cluster. When we update, we remove Queue A from the pool; all traffic fails over to Queue B. Update Queue A, rejoin. Retire Queue B. Upgrade. Rejoin Queue B.

When we really start worrying about zero downtime deploys I think this is the architecture we should use.

sopel commented 10 years ago

@dpb587's has summarized our requirements from a high level PoV above:

That being said, an easy, cluster-friendly alternative piece of software might very well be helpful to us, although none specifically come to mind yet.

I think Apache Kafka might be exactly that piece of software (which doesn't preclude using it, see #273 for the evaluation).

mrdavidlaing commented 10 years ago

I think Apache Kafka should definitely be on our shortlist; although we need to consider whether logstash can push / pull from it.

sopel commented 10 years ago

Cool, so let's handle these questions via #273 indeed.

Guess what I'm trying to promote here is that many of the challenges/issues with operating a distributed and fault tolerant cluster in general and for large scale message processing pipelines in particular are already and better solved elsewhere (as much as this is obviously a very hot topic just entering the main stage).

What we are trying to achieve with great success is assembling and tailoring available high quality components for our specific use case, and ideally generalizing that again in turn :)

sopel commented 10 years ago

So now that I've seeded my currently preferred mid-term architecture/solution, it's time to circle back to @mrdavidlaing's assessment:

However, our single node redis is working brilliantly currently; and can survive a restart due to the use of AOT so I don't see any short term advantage to moving away from redis just to get better clustering / fault tolerance.

As outlined, I think we should consider to address the EBS handling, which is also a pre condition for auto scaling that single instance tier as it is (for spot usage, automatic recovery and vertical scaling, i..e not for scaling out).

sopel commented 10 years ago

In case we decide not to do this right now and accepts the eventual minor loss of data on deployments (or handle it manually as suggested), we should probably icebox this issue until #273 is resolved, insofar it seems we cannot do much about Redis clustering and fault tolerance without reinventing wheels produced in much higher quality elsewhere already.

dpb587 commented 10 years ago

I've never heard of Kafka, but skimming the docs it looks valuable. Unfortunately, it doesn't seem like logstash supports it as an input. Logstash does, however, support rabbitmq...

mrdavidlaing commented 10 years ago

w.r.t our immediate needs, changing how our queue component won't save us much money; or give us much scalability. So I vote for Iceboxing this until after we have the ES nodes sorted out (since they are the bulk - 70% - of our costs, and also our current performance bottleneck)

sopel commented 10 years ago

Moved to Icebox due to focus on #270.

sopel commented 10 years ago

Closed as Won't Fix in favor of migrating to Kafka, see #273 and https://github.com/logsearch/logsearch-boshrelease/issues/73.

cityindex-attic / logsearch

Analyze/Implement Auto Scaling for managed queue (Redis) #269