Azure / azure-cosmos-dotnet-v3

.NET SDK for Azure Cosmos DB for the core SQL API
MIT License
731 stars 489 forks source link

Design: Manual control for failover mechanics #3488

Open ealsur opened 1 year ago

ealsur commented 1 year ago

Customer related issues: https://github.com/Azure/azure-cosmos-dotnet-v3/issues/2811

Problem statement

Availability and Latency are often tradeoffs. Sometimes in order to achieve higher availability we need to fall back the operation to another region, which would increase the end-to-end latency. In other occasions, we might prefer to fail fast, take the availability hit but be able to react faster (maybe decide to route traffic to another application region altogether).

There are cases where failing over to another region is the only alternative and there is no other choice, as per https://docs.microsoft.com/en-us/azure/cosmos-db/sql/troubleshoot-sdk-availability, such cases (like regions being removed or server-side failover) are handled by the SDKs.

There are however, some gray areas where SDKs can make best effort decisions but the logic around the best effort might not be universal/apply to all customers.

One such area is Timeouts.

We know timeouts can happen due to an array of possible causes:

  1. Client-side resource exhaustion
  2. Network infrastructure
  3. Transient Cosmos DB backend health being impacted

From the SDK perspective, we only see point 1 (we actually can only evaluate a subset of 1, which is CPU, we cannot evaluate things like SNAT port exhaustion), we cannot know if there are Network issues or if there is a transient health problem on the endpoint.

In a distributed system, we know timeouts can and will happen (https://docs.microsoft.com/en-us/azure/cosmos-db/sql/conceptual-resilient-sdk-applications#timeouts-and-connectivity-related-failures-http-408503), our SDKs retry under the assumption that these timeouts are transient in nature. And we also know that as long as the volume of these timeouts does not affect P99, then they are within the bounds of what is expected, but that doesn't mean we cannot react and provide higher availability if possible.

Existing SDK mechanics

(The following assumes https://github.com/Azure/azure-cosmos-dotnet-v3/issues/3467 is completed to align HTTP and TCP connectivity stack).

When a timeout occurs, the SDK will retry the operation on the current endpoint/region, if they are retryable (see the gap), and if local retries are exceeded, it might attempt a cross-region retry if the configuration applies. Assuming the cross-region retry works, this will provide Availability at the cost of Latency. The region is not marked as unavailable on the assumption that these timeouts are transient, so a subsequent request will target the original region that experienced the timeout and potentially succeed (or it can timeout too, depending on the nature of the cause).

Users can opt out of cross-region retries if they desire and it's on their responsibility to handle all failures in the application layer.

The region is not marked as unavailable due to the unclear nature of timeouts. Timeouts are simply not a deterministic signal, compared to any of the known cases.

When a region is marked as unavailable, it remains unavailable for 5 minutes (assuming the region is still present in the account), after which, the SDK will attempt to reach/retry on it, and if it fails with any of the known errors, it will go back and mark it unavailable again.

Where is the gap?

There are certain operations that are not retried across regions, among the critical are:

  1. Write timeouts on Single Master accounts
  2. Address resolution timeouts

For case 1, because there are no other Write regions, there is no recovery if all write operations are timing out, even if the region is marked as unavailable, if the account information is saying that the Write region is the same, the SDK can only route to that endpoint.

For case 2 however, if fetching Address from a region fails consistently with timeouts, marking that region as unavailable could allow the SDK to route requests to another region (assuming there is another region). The SDK would be able to fetch the addresses from another region and continue there.

The challenge for case 2 though is that attempting to apply any heuristics solution on the SDK such as "If we detect X amount of timeouts across Y minutes, then mark the region as unavailable" does not work. Because each customer is unique and the rule might work well for some customers but not for others, false positives could be as impactful and defining a heuristic that does not fit the customer workload simply won't help (for example, a rule of 10K timeouts every 1 minute when the customer is performing 1 request every minute, but that single request is very important, won't kick in or the same rule for a customer that does 300M request every minute does not even represent 1% failure rate).

Proposed solution strategy

Which are the signals that customers can perceive?

Using these signals and their own telemetry, customers could define their own heuristics that apply to their business case ("If we see X timeouts in Y minutes") and decide to redirect the SDK traffic temporarily to the next available region as a temporary mitigation.

client.MarkCurrentRegionUnavailable();
// or defining for how long
client.MarkCurrentRegionUnavailable(TimeSpan);

When providing this type of "power" it is key:

kanshiG commented 1 year ago

Super thanks for creating this concept - let us consider 3-4 regions and our burden of consistency (strong - we can push back - you have chosen consistency over availability), you should explicitly take down the impacted region. We should also consider customers doing forced failovers by themselves vs in the code. Or customers moving the traffic out of the impacted region when the region is offlined.

FabianMeiswinkel commented 1 year ago

+1 we have the ask by external customers for the Java SDK to help provide guidance and public API to build a health state model that can be used in Kubernetes to identify the health of the SDK as well. Not exactly the same - because the purpose of that ask is more to identify bad client-side nodes. But similar problem - because to build the health model you would need to put different statistics in perspective. Like if pod A is seeing high CPU - is the right mitigation to take it out assuming it is in bad health state - or are all other pods also in similar situation - in which case taking out pods would harm, not help. Of course, no SDK can build these health models - but we should think about what information to provide in public surface area - and create samples/best-practices on how they could be used to build similar health models. (in addition to some trigger to allow forcing fail-overs of course)