Size govcloud production logsearch deployment

cloud-gov / cg-atlas

Repository hosting issues and artifacts related to operations of the cloud.gov platform

Creative Commons Zero v1.0 Universal

3 stars 1 forks source link

Size govcloud production logsearch deployment #181

Open jmcarp opened 7 years ago

jmcarp commented 7 years ago

Since we don't have many tenants on govcloud yet, our logsearch deployment is smaller than on east, with only two data nodes (east is currently using ten). If the east deployment needs ten data nodes, we'll probably want to add more nodes to govcloud before asking more tenants to migrate.

I'm guessing this might be interesting to @datn, and I'm guessing @LinuxBozo or @sharms was involved in setting up the original cluster.

Acceptance criteria:

[x] Production logsearch in govcloud is running a sensible number and type of nodes, based on our experience on east
[ ] We have alerts that represent out heuristics for when and how to scale up in future

cnelson commented 7 years ago

This likely needs some attention in the near term.

Today we had an outage when elasticsearch master began reporting out of memory errors. We bumped it's instance size to quadruple available RAM as a temporary fix. However, even with the upgraded instance, starting up the cluster from a bad state still took 1.5 hours for it to return to green which IMO is far too long for a production system.

LinuxBozo commented 7 years ago

@cnelson master, as in single? There is currently IIRC a 3 master cluster in e/w. So yeah, between that and data nodes, we should revisit promptly.

mogul commented 7 years ago

This issue was about the sizing in particular. I think this is what @rogeruiz is working on this week, right? In which case, this should be In Progress... Moving it there now.

cnelson commented 7 years ago

I put together a bare bones calculator which may be helpful when thinking about the various bits of data that go into making sizing decisions for elastic search. Some of which we can control (index strategy, sharding strategy, instance sizing) and some of which we cannot (retention requirements, volume of data per day).

After playing with this for a bit, I think we should do the following to size our cluster appropriately and fix the stability issues we've been seeing:

Change our indexing strategy back to index-per-day (currently we are index-per-space-per-day)
Leave our sharding strategy as is
Reindex old data to index-per-day in order to bring our total number of shards down to a reasonable size

cnelson commented 7 years ago

We are moving forward with the reindexing steps described above:

@rogeruiz and @rememberlenny are working on an upstream PR to expose indexing strategy as an option.

@jmcarp @cnelson working on determining the fastest way to re-index and starting that process.

cnelson commented 7 years ago

We have alerts that represent out heuristics for when and how to scale up in future\

I'm unsure what kind of alerts we need / want for this AC.

If we are concerned about cluster health I think we have those already, we'll be alerted if cpu/disk/memory goes above thresholds on our data nodes.

If we are concerned about needing to adjust our sharding strategy as our log volume increases over-time to maintain search performance, perhaps we should add some timeouts to the queries in in check-logs.sh and alert if we don't receive a response in that window?

That would be an indicator that our log volume is growing, and that we need to up the number of shards per index going forward.

Thoughts?

cnelson commented 7 years ago

After discussing yesterday we've come up with the following plan for better on alerting on when the cluster needs to scale:

Accept this story as-is to make WIP room to start on https://github.com/18F/cg-product/issues/673 which we need to complete before we add any new functionality to logsearch so that we can iterate on it safely without risk of downtime.

Once that's completed, we've decided that we will get get the best data on real-world query performance by using new relic to monitor traffic from actual users and as a bonus we'd get logsearch response times on statuspage for free: https://github.com/18F/cg-product/issues/693