cloudfoundry / diego-notes

Diego Notes
Apache License 2.0
23 stars 7 forks source link

Tuning Health Checks #31

Closed emalm closed 8 years ago

emalm commented 8 years ago

I've just added https://github.com/cloudfoundry-incubator/diego-dev-notes/blob/master/proposals/tuning-health-checks.md as an initial proposal of some ideas around how we can tune health-checking to induce less load on the system, generate fewer extraneous logs, and be more configurable to end users. Please look through it and use this issue for discussion!

onsi commented 8 years ago

On the whole LGTM. As I was reading the doc, a timeline of work came to mind that seemed to make sense:

  1. Set up a small repro environment to reproduce the evac thrashing issue. I suspect a single cell where we attempt to simultaneously launch and health-check ~40 containers should suffice.
  2. Change the health-check timeout to 2 seconds and evaluate whether or not it helps.
  3. Spike out the NetCheckAction and evaluate whether or not it helps.
  4. Implement the NetCheckAction with backwards compatibility.
  5. Improve logging
  6. Make the health-check timing configurable.

WDYT @ematpl ?

Also: I think 6 is not MVP. Probably fine to drop it especially since actually exposing it will entail plumbing up through CC and CLI.

emalm commented 8 years ago

Yep, that matches my thinking about the story sequencing. I suspect the existing unhealthy-check duration configurability will help accommodate different instance densities, with the expectation that at the current status quo the duration needs to grow linearly with density, constant TBD. Could be an interesting parameter to tune across a range of values once we have consistent reproduction. Agreed that the per-LRP configurability is not MVP at all, but had been mulling it over and wanted to capture it somewhere.

onsi commented 8 years ago

We're going to need for an app restart (cf push/cf restart) in order to regenerate the LRPs and get the new behavior we need.

I know it's crossing the streams and ugly but... perhaps we could implement a BBS migration that migrates LRPs in the CF domain that have a process health check that matches the existing port check to use the new netcheck action?

emalm commented 8 years ago

Oof, that's crossing a lot of boundaries. This issue exposes a deficiency in CC's Diego backend (including nsync): because the LRP process-guid corresponds exactly to the CF app's guid and version, there's no way for the app's LRP to change without recreating the LRP in Diego. If they were decoupled, the backend could migrate the LRP definition transparently without end-user or even CC intervention, potentially even in a blue-green-deploy manner (scale up and down gradually to ensure no more than 1 extra instance during the transition). This is potentially beneficial for many other similar situations that may arise, such as changing internal CC endpoints for binary resources and versions of the lifecycle binaries, as well as other changes to the Diego models used to construct LRPs. Of course, that decoupling comes at a cost: for CC to address the LRPs corresponding to an app, it either needs to record the actual process-guids for the LRPs itself, or provide additional metadata to Diego to retrieve them later (although to do so efficiently Diego would likely need to build that into its relational model).

If an operator really needs this issue corrected for all existing CF-app LRPs on the system, they could request to end users that they do that restart themselves, or incrementally delete LRPs from Diego itself and wait for the nsync-bulker to repopulate them (effectively doing that restart themselves, perhaps timed right before a bulk cycle to minimize app downtime).

emalm commented 8 years ago

Update: we spiked out the native NetCheckAction, and it didn't seem to have a significant effect. We'll keep this in mind for future work, but it's not likely to be a priority. We've instead found that spreading out in-flight container starts more evenly has a much more significant effect on evacuation performance, and in general makes it harder to bog an individual cell down with new work in any sizable deployment. Closing this out.