Resiliency to a single degraded Availability Zone

46bit commented 3 years ago

This issue is not a bug in cf-deployment, but it's to discuss solving a common incident for CF operators.

What is this issue about?

Several Cloud Foundry users have had outages when a single Availability Zone experiences a partial failure. Incidents like degraded networking are far more common in the Cloud than a complete outage.

Cloud Foundry is engineered to run in multiple AZs, but not to handle degraded single AZs. When a single AZ is only degraded, Cloud Foundry will keep directing new app instances and new web requests into that degraded AZ. These requests will be slow or fail. This makes Cloud Foundry partly down for its users, and right now there are few good options.

What can CF operators do right now?

Neither of the options we're aware of are very good.

Very slow: You can edit the CF manifest and do a new BOSH deploy that doesn't have VMs in the affected AZ. This is far too slow as the BOSH deploy could take an entire day for the largest CF platforms.

Slow/manual: You can choose to manually block all network traffic into the degraded Availability Zone, for instead using firewall rules. This is the approach being used by GOV.UK PaaS. This has the advantage of being very simple, but it's not seen as automated or fast enough for SAP's needs.

What do you propose?

At SAP, we think the best solution is for each VM to monitor its health. For instance an operator could configure a list of network checks. If too many of the checks fail, the VM would drain itself and kill the BOSH agent. This could also be part of Diego, and trigger a call to Rep's evacuate endpoint.

This solution can cope with more just degraded AZs, as it would drain individual degraded servers (e.g. failing racks.)

Badly chosen checks could make cells drain themselves wrongly and lead to CF downtime. This would probably be an optional feature and so the CF operator would be able to choose good network resources to check (e.g. a combination of the CF API, S3, etc.)

Tag your pair, your PM, and/or team!

Working on this with @h0nlg at SAP. Briefly talked about this with @rkoster and @AP-Hunt.

rkoster commented 3 years ago

Would like to add that we have seen customer cases where similar failures were caused by degraded disk performance.

risicle commented 3 years ago

We were looking into this area a month or so ago and I desperately wanted to try and find a way of observing CF's already-existing healthchecks either through monitoring or logs instead of adding yet another set of canaries for this. I was hopeful in spotting the detected-missing-cells logs, but of course it doesn't include AZ information, and when I looked into whether it was possible to add that I realized that it wouldn't be that easy - from my own notes:

Once the BBS detects a cell as potentially missing, the zone information is already gone - it came from the entry in Locket. The table it has of ActualLRPs doesn't keep zone information.

So we did end up deploying another set of healthcheck canaries for this purpose..

AP-Hunt commented 3 years ago

As @46bit says, the GOV.UK PaaS approach is to block all traffic to the AZ (we're in AWS, so it's done with a network ACL). In testing, this appeared to work sufficiently well, and we saw CloudFoundry correctly redistribute tenant applications running in the affected AZ on to cells in the unaffected AZ.

We've also gone one step further and automated the process of removing the AZ in Bosh via our pipelines (specifically, we apply an ops file with removes the AZ from every instance group). The goal here is to be able to restore the level of capacity we had before the AZ outage by spreading it over the remaining AZs, so that we don't run in to resource contention problems when 100% of the platform load is placed on 66% of the capacity.

We've identified a couple of problems so far, but we think we're not really in a position to solve them:

We need to turn off the Bosh resurrector before disabling the AZ, or Bosh will retain a lock on the deployment while it tries to fix unresponsive VMs that it can't communicate with
It takes a long time to deploy
If the affected AZ contains our Bosh Director, there's not a lot we can do
It's very manual and I don't think I'd enjoy doing it in the middle of the night

I think if I could wave a magic wand today and get a solution instantly, I think I'd like it to lay with Bosh. It would be a very nice capability for it to

Aggregate health information over an AZ similar to the way @46bit is proposing
Raise an alarm via a metric if an AZ appears to be degraded, so that an operator can make a decision
Have an API method for temporarily overriding the AZs defined in the manifest, and have Bosh immediately redistribute the VMs that were in the removed AZ over the remaining AZs

I say I think it should lay with Bosh, because I think it'd be nice to have these capabilities for non-cf-deployment stuff like the application autoscaler too.

h0nIg commented 3 years ago

@AP-Hunt we have seen a lot of cases where the EC2 API was overloaded / not reachable and AWS does not guarantee any kind of free resources during such a large scale event. Therefore i would like to conclude that you can not 100% rely on respawning bosh VM's. Instead you should overprovision before the incident will happen and just evacuate the workload during the degragation to the remaining AZ's. If the AZ is healthy again according to the local health check, the REP process (or a process next to REP) can let the diego cell receive workload again. GoRouters should be covered by the hyperscaler loadbalancer with health checks, if they are slow = remove GoRouter from loadbalancing

46bit commented 3 years ago

Not to worry—they maintain roughly the right amount of spare capacity.

jpalermo commented 3 years ago

If something like this existed, it feels like it should live in the HealthMonitor. Agents already send alerts to the HealthMonitor so this feels like it would fit well alongside that existing functionality.

The HealthMonitor also already has a plugin interface. You can co-locate a job that includes a bosh-monitor binary and HealthMonitor will call that job with JSON stdin that contains "something". Never really looked at it, but it may be useful here (or maybe not)

One of the reasons something like this hasn't been built in the past is it's likely not to solve most of the problems, and is certainly not a silver bullet for partially degraded AZs.

The whole reason the "meltdown" trigger exists in the HealthMonitor is IAASs typically aren't happy about being asked to do things when they're already in a broken state. The idea of ignoring the IAAS and just draining the jobs does work around a good chunk of those problems. It would be pretty simple for the HealthMonitor to just bosh stop the instances which should trigger all the normal bosh lifecycle events.

Actually detecting problems is sort of a nightmare though. We're looking for problems that the Agent can accurately detect from inside the VM, but that wouldn't prevent the HealthMonitor and the VM from communicating. I don't think there is a one size fits all solution for every use case, so we'd need some way to have a runtime config with a job that the agent knows how to call to ask it to check the health maybe? Or some other mechanism for configuring the agent so it knows who to ask for health info...

I'm a bit skeptical of the idea of automatically rebalancing the workloads onto the remaining AZs. The main reason to use AZs is for HA. But if you need the full capacity of all of your AZs to be able to maintain your workloads, you're not really HA. If you are using AZs for HA, your system should be able to work fine with one of the AZs totally dead.

AP-Hunt commented 3 years ago

I think you're totally right about there not being a one size fits all solution. That's why I'd vote for Bosh being able to raise an alarm or change a metric value. That would allow operators to respond appropriately for their situation (be that automatically, or with some manual intervention, or a mix of the two).

I also wouldn't vote for automatic re-balancing, because that isn't right for everyone either. It would work for GOV.UK PaaS because we run enough capacity in 2 AZs to cover a missing AZ, but it's better experience overall (e.g. lower overall demand on each cell) if we're able to run 100% capacity in those two AZs while the third recovers.

I personally wouldn't be fussed if the implementation of rebalancing existed within Bosh, or if it was up to operators to change their manifests to remove the AZ.

On Mon, 13 Sep 2021, 17:22 Joseph Palermo, @.***> wrote:

If something like this existed, it feels like it should live in the HealthMonitor. Agents already send alerts to the HealthMonitor so this feels like it would fit well alongside that existing functionality.

The HealthMonitor also already has a plugin interface. You can co-locate a job that includes a bosh-monitor binary and HealthMonitor will call that job with JSON stdin that contains "something". Never really looked at it, but it may be useful here (or maybe not)

One of the reasons something like this hasn't been built in the past is it's likely not to solve most of the problems, and is certainly not a silver bullet for partially degraded AZs.

The whole reason the "meltdown" trigger exists in the HealthMonitor is IAASs typically aren't happy about being asked to do things when they're already in a broken state. The idea of ignoring the IAAS and just draining the jobs does work around a good chunk of those problems. It would be pretty simple for the HealthMonitor to just bosh stop the instances which should trigger all the normal bosh lifecycle events.

Actually detecting problems is sort of a nightmare though. We're looking for problems that the Agent can accurately detect from inside the VM, but that wouldn't prevent the HealthMonitor and the VM from communicating. I don't think there is a one size fits all solution for every use case, so we'd need some way to have a runtime config with a job that the agent knows how to call to ask it to check the health maybe? Or some other mechanism for configuring the agent so it knows who to ask for health info...

I'm a bit skeptical of the idea of automatically rebalancing the workloads onto the remaining AZs. The main reason to use AZs is for HA. But if you need the full capacity of all of your AZs to be able to maintain your workloads, you're not really HA. If you are using AZs for HA, your system should be able to work fine with one of the AZs totally dead.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cloudfoundry/cf-deployment/issues/939#issuecomment-918359271, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANKTOTQGEM3ZXDY76EESX3UBYQLDANCNFSM5DQNQHKQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Benjamintf1 commented 3 years ago

I know one thing that has been tossed around is the idea of a full http/readiness check, rather then only the local healthiness checks on the bosh vms. If such a (large tbh) change was implemented, we would get a lot of interesting outcomes. One of those outcomes would be putting individual app instances behind some level of iaas level network healthiness check. This would of course, not reschedule instances to other azs, or solve the problem with the az itself, or reschedule diego cells, but it would, in the instance of a az network failure, probably start to remove app instances from routing tables on that instance across the board. (that said, I think you also might experience problems where your load balancer still get's served to and then redirects to other azs perhaps still experiencing network difficulties).

Either way, I think this problem is actually quite complex and intersecting with other fields. There's probably a great variety of techniques that can be applied to this and I think we need some upper level (perhaps working group or higher level) set of understandings or plans in order to approach some of this "correctly" or "completely", or even "to the satisfaction of this particular problem". Specifically, we'd do well to rethink what "high availability means" and what steps are expected to be manual and what are expected to be automatic, and within what parameters those steps are to be automatic.

46bit commented 3 years ago

Thanks for your feedback everyone.

It sounds like BOSH could be evolved to natively support solving problems like this, or even have it added as a plugin. That would be quite neat, but SAP don't have a Highly-Available BOSH. Neither does GOV.UK PaaS. As I understand it, HA BOSH isn't widely used by anyone. That makes it quite a bad place to solve issues like this: there's a 1/N chance that BOSH itself is affected.

At SAP we've been working on an agent and boshrelease named Runtime Evacuation. It'll be deployed on each Diego cell, monitor network performance, and drain Rep if the network appears to be badly compromised. The critical challenge isto avoid creating new issues (e.g. all the cells deciding to switch off at once), so to start with we're going to disable it taking action and monitor the data for awhile.

Hopefully this can be open sourced, I think we're looking into it.

I'm a bit skeptical of the idea of automatically rebalancing the workloads onto the remaining AZs. The main reason to use AZs is for HA. But if you need the full capacity of all of your AZs to be able to maintain your workloads, you're not really HA. If you are using AZs for HA, your system should be able to work fine with one of the AZs totally dead.

A totally dead AZ is easy to deal with, but that's also very rare. Much more common is degraded AZs, with slower responses and higher error rates. Those are a bit of a nightmare. CF won't route traffic away from the AZ unless it's completely dead. Both SAP and GOV.UK PaaS have had situations where 1/3rd-ish of traffic is having major problems even though 2/3 AZs are perfectly healthy.

jpalermo commented 3 years ago

A totally dead AZ is easy to deal with, but that's also very rare. Much more common is degraded AZs, with slower responses and higher error rates. Those are a bit of a nightmare. CF won't route traffic away from the AZ unless it's completely dead. Both SAP and GOV.UK PaaS have had situations where 1/3rd-ish of traffic is having major problems even though 2/3 AZs are perfectly healthy.

I've seen that 1/3rd degraded traffic before too and we've yet to find a spot that feels good to build a solution into. There's always a tricky mix of "It can solve this particular problem, but will actually make this other problem worse".

So something that's able to solve even some of those degraded AZ problems, while not doing harm in other situations would be amazing.

cloudfoundry / cf-deployment