Design: Manual control for failover mechanics

ealsur commented 2 years ago

Problem statement

Availability and Latency are often tradeoffs. Sometimes in order to achieve higher availability we need to fall back the operation to another region, which would increase the end-to-end latency. In other occasions, we might prefer to fail fast, take the availability hit but be able to react faster (maybe decide to route traffic to another application region altogether).

There are cases where failing over to another region is the only alternative and there is no other choice, as per https://docs.microsoft.com/en-us/azure/cosmos-db/sql/troubleshoot-sdk-availability, such cases (like regions being removed or server-side failover) are handled by the SDKs.

There are however, some gray areas where SDKs can make best effort decisions but the logic around the best effort might not be universal/apply to all customers.

One such area is Timeouts.

We know timeouts can happen due to an array of possible causes:

Client-side resource exhaustion
Network infrastructure
Transient Cosmos DB backend health being impacted

From the SDK perspective, we only see point 1 (we actually can only evaluate a subset of 1, which is CPU, we cannot evaluate things like SNAT port exhaustion), we cannot know if there are Network issues or if there is a transient health problem on the endpoint.

In a distributed system, we know timeouts can and will happen (https://docs.microsoft.com/en-us/azure/cosmos-db/sql/conceptual-resilient-sdk-applications#timeouts-and-connectivity-related-failures-http-408503), our SDKs retry under the assumption that these timeouts are transient in nature. And we also know that as long as the volume of these timeouts does not affect P99, then they are within the bounds of what is expected, but that doesn't mean we cannot react and provide higher availability if possible.

Existing SDK mechanics

(The following assumes https://github.com/Azure/azure-cosmos-dotnet-v3/issues/3467 is completed to align HTTP and TCP connectivity stack).

When a timeout occurs, the SDK will retry the operation on the current endpoint/region, if they are retryable (see the gap), and if local retries are exceeded, it might attempt a cross-region retry if the configuration applies. Assuming the cross-region retry works, this will provide Availability at the cost of Latency. The region is not marked as unavailable on the assumption that these timeouts are transient, so a subsequent request will target the original region that experienced the timeout and potentially succeed (or it can timeout too, depending on the nature of the cause).

Users can opt out of cross-region retries if they desire and it's on their responsibility to handle all failures in the application layer.

The region is not marked as unavailable due to the unclear nature of timeouts. Timeouts are simply not a deterministic signal, compared to any of the known cases.

When a region is marked as unavailable, it remains unavailable for 5 minutes (assuming the region is still present in the account), after which, the SDK will attempt to reach/retry on it, and if it fails with any of the known errors, it will go back and mark it unavailable again.

Where is the gap?

There are certain operations that are not retried across regions, among the critical are:

Write timeouts on Single Master accounts
Address resolution timeouts

For case 1, because there are no other Write regions, there is no recovery if all write operations are timing out, even if the region is marked as unavailable, if the account information is saying that the Write region is the same, the SDK can only route to that endpoint.

For case 2 however, if fetching Address from a region fails consistently with timeouts, marking that region as unavailable could allow the SDK to route requests to another region (assuming there is another region). The SDK would be able to fetch the addresses from another region and continue there.

The challenge for case 2 though is that attempting to apply any heuristics solution on the SDK such as "If we detect X amount of timeouts across Y minutes, then mark the region as unavailable" does not work. Because each customer is unique and the rule might work well for some customers but not for others, false positives could be as impactful and defining a heuristic that does not fit the customer workload simply won't help (for example, a rule of 10K timeouts every 1 minute when the customer is performing 1 request every minute, but that single request is very important, won't kick in or the same rule for a customer that does 300M request every minute does not even represent 1% failure rate).

Proposed solution strategy

Which are the signals that customers can perceive?

High latency driven by SDK cross-region retries. The Diagnostics on an operation expose the GetContactedRegions that can show if a request reached 2 regions.
Request Timeout errors (408s/OperationCanceledException)
503s ServiceUnavailable

Using these signals and their own telemetry, customers could define their own heuristics that apply to their business case ("If we see X timeouts in Y minutes") and decide to redirect the SDK traffic temporarily to the next available region as a temporary mitigation.

client.MarkCurrentRegionUnavailable();
// or defining for how long
client.MarkCurrentRegionUnavailable(TimeSpan);

If the account is Single Master, then it can only mark the current endpoint unavailable for reads. Write endpoint is defined by the account information, if we mark it as unavailable, where would the writes go?
If the account is Multi Master, then it can mark the current endpoint unavailable for writes and reads.
It can only be used if the account has more than 1 region.
It does not require PreferredRegions to be defined, but if they are, the next region to use would be the next PreferredRegion.

When providing this type of "power" it is key:

Having good guiding documentation for customers to build the right patterns.
Logging. Tracing or outputting that the SDK is routing to a particular endpoint because of customer decision is critical for debugging later on why is the SDK behaving in a certain way. Logs/Traces are an option, but can we add something to the Diagnostics themselves?
For how long to mark it as unavailable? SDKs use 5 minute intervals, after which they try back on the original region, if the failure happens again, they again mark the region unavailable. If we give the user the option to set a Time (for how long is the region marked as unavailable), will they be ok with us falling back to the original region after? Should there be a max limit on the time?

kanshiG commented 1 year ago

Super thanks for creating this concept - let us consider 3-4 regions and our burden of consistency (strong - we can push back - you have chosen consistency over availability), you should explicitly take down the impacted region. We should also consider customers doing forced failovers by themselves vs in the code. Or customers moving the traffic out of the impacted region when the region is offlined.

FabianMeiswinkel commented 1 year ago

+1 we have the ask by external customers for the Java SDK to help provide guidance and public API to build a health state model that can be used in Kubernetes to identify the health of the SDK as well. Not exactly the same - because the purpose of that ask is more to identify bad client-side nodes. But similar problem - because to build the health model you would need to put different statistics in perspective. Like if pod A is seeing high CPU - is the right mitigation to take it out assuming it is in bad health state - or are all other pods also in similar situation - in which case taking out pods would harm, not help. Of course, no SDK can build these health models - but we should think about what information to provide in public surface area - and create samples/best-practices on how they could be used to build similar health models. (in addition to some trigger to allow forcing fail-overs of course)

Azure / azure-cosmos-dotnet-v3