Closed philipthomas-MSFT closed 4 months ago
Scope of work section
Suggestion:
The Microsoft Azure Cosmos DB .NET SDK Version 3 for NoSQL API needs to achieve higher availability for strong consistency by implementing per-partition automatic failover for both single-region write accounts. The premise is that if communications to a partition, either server or master, meets the criterion for per-partition automatic failover, then the SDK will automatically try to promote the next available read region to a write region.
...
Strong consistency is supported in this iteration of development due to its guaranteed reads from the most recent committed version of an item. Although other consistency levels are pretermitted, there will be plans to support them in the future.
I would reword this. For example:
The Microsoft Azure Cosmos DB .NET SDK Version 3 for NoSQL API needs to be able to correctly respond to a backend partition failing over for single-region write accounts in order to achieve higher availability. Upon a backend partition - either server or master - failover, the SDK needs to automatically detect this condition and redirect subsequent write requests to the new write region for the partition. Per-partition automatic failover is initially rolled out only for select Strong Consistency accounts, but will later be available for all consistency levels.
Criteria for per-partition automatic failover
Q: What do you mean by "Pretermitted HTTP sub statuses"? Are these HTTP status codes for which we explicitly do not want to failover?
More generally, the Cosmos DB Core team proposes a solution where we should always attempt to retry any error in a different region (based on a region priority list), possibly after first retrying on a different in-region replica first UNLESS the error is a very specific one which clearly indicates that we shouldn't retry in a different region (e.g. a split where backend returns 410.1002). One rationale for this approach is because the Backend can't possibly know about each possible error condition and there are many errors the Backend is not in control over (network stack, service fabric just to name a few). I also spoke to Fabian about this and he agrees; a couple of points from my discussion with him:
As for refreshing of partition state we propose that this is not done for PPAF. Once we establish that a partition has failed over (e.g., from region A to B) the SDK should add an override for that partition (pk range); this override remains in place until we see failures in the failover region (e.g. 403.3 in case of a "clean" failover) in which case we will retry regions again in priority order. Once we successfully establish a new region for the pk range we either clear override or add a new region as the override. What this means is that there is no need to talk to the PPAF "Fault Tolerant Store" (CASPaxos store); this is preferred given that CAXPaxos store is not scaled to handling high traffic loads and furthermore retries on a different replica and or region is cheap. Spoke to Fabian about this as well and we are largely aligned here; one concern he raised is that it will add more "speculative" attempts which may have an impact on customer RU consumption for example however the fact that customer needs to explicitly opt-in to PPAF largely alleviates this.
Best practices such as trying with exponential back-off should be employed. As for retrying writes, we should only do this on very specific error codes where we can be assured that the backend rejected the writes (403.3 for example).
Based on our discussion @mikaelhoral-microsoft . Does this apply to just SDK, or also Routing Gateway? Because Routing Gateway is following the same criteria as we are with looking for specific status/substatus codes.
Applies equally to Routing Gateway
As discussed, in "No per-partition automatic failover cases" let's point to https://msdata.visualstudio.com/CosmosDB/_workitems/edit/2475521
We are tracking SDK backlog items here: https://msdata.visualstudio.com/CosmosDB/_workitems/edit/2484362. These are from PPAF testing conducted by Backend team.
Purpose statement
This is a document to enhance the Cosmos DB experience by achieving even higher availability.
Description: The Microsoft Azure Cosmos DB .NET SDK Version 3 plan to support per-partition automatic failover for data and control plane operations that are requested to server and master partitions, respectively for strong consistency. There will be a separate document to address the Java SDK. It also must be understood that this scope must reach over to Compute Gateway and Cassandra over Compute.
Tasks
Stakeholders
Resources
Out of scope
Scope of work
The Microsoft Azure Cosmos DB .NET SDK Version 3 for NoSQL API needs to achieve higher availability for strong consistency by implementing per-partition automatic failover for both single-region write accounts. The premise is that if communications to a partition, either server or master, meets the criterion for per-partition automatic failover, then the SDK will automatically try to promote the next available read region to a write region. This must also work for Cassandra API over Compute. Although other Cosmos DB APIs are not supported initially, the shared code base between SDK client and Compute should automagically work for them.
Strong consistency is supported in this iteration of development due to its guaranteed reads from the most recent committed version of an item. Although other consistency levels are pretermitted, there will be plans to support them in the future.
[insert visual aid here]
Before we continue, for clarity and level-setting, a server partition is a partition that is used for data-plane document operations. A master partition is a partition used for control-plane (database, collection, etc.) and meta data (account) operations for building and accessing location, collection and partition key range caches. Next, we will talk about the criteria for per-partition automatic failover.
Criteria for per-partition automatic failover
It is important to note, that there are other sub status codes that exist for these statuses. There are offline conversations happening with collaborating teams to determine if we need to expand our criteria for per-partition automatic failover or continue to pretermit them, but as of this moment, this the complete list. I will list the pretermitted HTTP statuses below.
HTTP statuses
Modes, operations, and http statuses
Pretermitted HTTP sub statuses
Service Unavailable (503)
InsufficientBindablePartitions (1007)
ComputeFederationNotFound (1012)
OperationPaused (9001)
ServiceIsOffline (9002)
InsufficientCapacity (9003)
ServerGenerated503 (21008)
Forbidden (403)
ProvisionLimitReached (1005)
DatabaseAccountNotFound (1008)
RedundantCollectionPut (1009)
SharedThroughputDatabaseQuotaExceeded (1010)
SharedThroughputOfferGrowNotNeeded (1011)
PartitionKeyQuotaOverLimit (1014)
SharedThroughputDatabaseCollectionCountExceeded (1019)
SharedThroughputDatabaseCountExceeded (1020)
ComputeInternalError (1021)
ThroughputCapQuotaExceeded (1028)
InvalidThroughputCapValue (1029)
RbacOperationNotSupported (5300)
RbacUnauthorizedMetadataRequest (5301)
RbacUnauthorizedNameBasedDataRequest (5302)
RbacUnauthorizedRidBasedDataRequest (5303)
RbacRidCannotBeResolved (5304)
RbacMissingUserId (5305)
RbacMissingAction (5306)
RbacRequestWasNotAuthorized (5400)
NspInboundDenied (5307)
NspAuthorizationFailed (5308)
NspNoResult (5309)
NspInvalidParam (5310)
NspInvalidEvalResult (5311)
NspNotInitiated (5312)
NspOperationNotSupported (5313)
Request Timeout (408)
It also noted that since certain Gone (410) HTTP statuses and sub statuses are converted to Service Unavailable (503), they are eligible for per-partition automatic failover while others are not. Please refer to SdkDesign for more information.
Current base architecture
Currently we have support to per-partition automatic failover to regions in a couple of ways that give us limited to optimal support for successful per-partition automatic failover. Please refer to #2395.
The 1st being the control-plane meta data account information that is HTTP requested via the global account endpoint. If the SDK client is cold, which means that it is initialized for the first time, the SDK client has access to regions/locations that were managed and configured on the account level. If the SDK client is hot, which means that it has already been initialized on a previous request, the SDK client has access to the regions/locations that are cached in LocationCache to avoid making future HTTP requests to the gateway endpoint again. There are some triggers that invoke a refresh.
The 2nd being the ApplicationPreferredRegions on CosmosClientOptions that is set during design time by the customer within the SDK client.
When having both of these to leverage, the SDK client will give you the most optimal form of per-partition automatic failover when the failure criteria is met. It is also important to note that the EnablePartitionLevelFailover boolean flag must set to true on CosmosClientOptions in order for the logic for per-partition automatic failover to be executed. Having just one or the other gives the SDK client limited per-partition automatic failover support. Having neither will gives the SDK client no per-partition automatic failover support, and we will talk about that next.
Here is a more detailed breakdown and analysis of the current baseline architecture.
No per-partition automatic failover cases
For those cases where the SDK client is cold and the criterion for per-partition automatic failover is met while attempting to request control-plane meta data (account) information to access regions/locations, and the customer has not set ApplicationPreferredRegions on CosmosClientOptions, there is no per-partition automatic failover support, and will usually result in a online support call or a manual failover. For clarity and level-setting, a manual failover is when a read region is intentionally and manually, via Azure Portal, promoted to a write region, and the defaulted/preferred write region that is offline is demoted to a read region when it comes back online. To learn more, please refer to High Availability.
[insert visual aid here]
Proposed solution
It would be advantageous to enhance per-partition automatic failover within the SDK client by introducing DNS TXT records that is both configured and managed by the current ARM management workflow. More on this below below. The routing gateway team has already adopted this as a solution and is currently responsible for creating the DNS TXT records. It is up to the SDK team to enhance the SDK client to leverage these DNS TXT records in the event if there is no way to access account information from the gateway endpoints. The DNS TXT records will include other regional account names that the SDK client can iterate and cache once a successful request has been achieved. Next, we will talk about the 2 most reasonable solutions for querying DNS TXT records within the SDK client.
For clarity and level-setting, DNS TXT records are a type of Domain Name System (DNS) record in text format, which contain information about your domain.
[insert visual aid here]
Branch
https://github.com/Azure/azure-cosmos-dotnet-v3/tree/users/philipthomas-MSFT/per-partition-failover-dns-query-txt-records
DNS TXT record
Key (Global database account endpoint)
Value
Configuration and management
Shading DNS client inside of SDK client
Shading DNS client inside hosted federated server
Further below is a larger "exhaustive" table of dns solutions that in one way or the other, has more pros than cons.
Open-source software
Performance
Security
Areas of impact
Supportability
Client telemetry
Distributed tracing
TBD
Diagnostic logging
Sample Diagnostics
Testing
Use cases/scenarios
Please use Gherkin syntax (Given, When and Then)
Critical paths where there is no per-partition automatic failover support in the current baseline architecture
Cold SDK client, explicit data-plane operation, Implicit control-plane meta data (account) point of failure
Cold SDK client, explicit control-plane operation, implicit control-plane meta data (account) point of failure
Unit (Gated pipeline)
Emulator (Gated pipeline)
Performance/Benchmarking (Gated pipeline)
Security/Penetration (Gated pipeline)
DNS solution comparison matrix