philipthomas-MSFT commented 1 year ago

Purpose statement

This is a document to enhance the Cosmos DB experience by achieving even higher availability.

Description: The Microsoft Azure Cosmos DB .NET SDK Version 3 plan to support per-partition automatic failover for data and control plane operations that are requested to server and master partitions, respectively for strong consistency. There will be a separate document to address the Java SDK. It also must be understood that this scope must reach over to Compute Gateway and Cassandra over Compute.

Tasks

[x] Identifying stakeholders
[x] Understanding source current architecture
[x] Detailing the target future architecture and gaps
[x] Describing the testing strategy
[x] Identifying performance and security concerns
[x] Defining criteria for per-partition automatic failovers
[x] Defining out of scope and future states
[ ] Architecture katas
[x] Outline supportability, manageability, and configuration
[ ] Providing visual aids (diagrams, flow charts, C4, etc.)

Stakeholders

Routing Gateway
- Dinesh Billa
Backend / HA
- Mikael Horal
- Abhishek Kumar
- Andres Araya
- Josh Rowe
Compute Gateway
- Vinod Sridharan
Azure Cosmos DB .NET SDK for NoSQL
- Philip Thomas
- Fabian Meiswinkel
ARM Management Workflow
- Naveen Verma
Other collaborations
- Matias Quaranta - General knowledge
- Kiran Kumar Kolli - General knowledge
- Sourabh Jain - Client telemetry
- Arooshi Avasthy - Distributed tracing

Resources

Out of scope

JAVA SDK is not required for GA
Consistency levels
Legacy gateway
Multi-write regions
MongoDB API
Table API
PostgreSQL API
Gremlin API

Scope of work

The Microsoft Azure Cosmos DB .NET SDK Version 3 for NoSQL API needs to achieve higher availability for strong consistency by implementing per-partition automatic failover for both single-region write accounts. The premise is that if communications to a partition, either server or master, meets the criterion for per-partition automatic failover, then the SDK will automatically try to promote the next available read region to a write region. This must also work for Cassandra API over Compute. Although other Cosmos DB APIs are not supported initially, the shared code base between SDK client and Compute should automagically work for them.

Strong consistency is supported in this iteration of development due to its guaranteed reads from the most recent committed version of an item. Although other consistency levels are pretermitted, there will be plans to support them in the future.

[insert visual aid here]

Before we continue, for clarity and level-setting, a server partition is a partition that is used for data-plane document operations. A master partition is a partition used for control-plane (database, collection, etc.) and meta data (account) operations for building and accessing location, collection and partition key range caches. Next, we will talk about the criteria for per-partition automatic failover.

Criteria for per-partition automatic failover

It is important to note, that there are other sub status codes that exist for these statuses. There are offline conversations happening with collaborating teams to determine if we need to expand our criteria for per-partition automatic failover or continue to pretermit them, but as of this moment, this the complete list. I will list the pretermitted HTTP statuses below.

HTTP statuses
- Service Unavailable/Unknown (503.0)
- Forbidden/WriteForbidden (403.3)
- Request Timeout (408)
Modes, operations, and http statuses
- Direct mode, Data-plane operations that yield a Service Unavailable/Unknown (503.0) HTTP status
- Direct mode, Data-plane operations that yield a Forbidden/WriteForbidden (403.3)
- Direct mode, Data-plane operations that yield a Request Timeout (408)
- Gateway mode, Data-plane operations that yield a Service Unavailable/Unknown (503.0) HTTP status
- Gateway mode, Data-plane operations that yield a Forbidden/WriteForbidden (403.3)
- Gateway mode, Data-plane operations that yield a Request Timeout (408)
- Gateway mode, Control-plane operations that yield a Service Unavailable/Unknown (503.0) HTTP status
- Gateway mode, Control-plane operations that yield a Forbidden/WriteForbidden (403.3)
- Gateway mode, Control-plane operations that yield a Request Timeout (408)
- Gateway mode, Control-plane meta data operations that yield a Service Unavailable/Unknown (503.0) HTTP status
- Gateway mode, Control-plane meta data operations that yield a Forbidden/WriteForbidden (403.3)
- Gateway mode, Control-plane meta data operations that yield a Request Timeout (408)
Pretermitted HTTP sub statuses
- Service Unavailable (503)
- InsufficientBindablePartitions (1007)
- ComputeFederationNotFound (1012)
- OperationPaused (9001)
- ServiceIsOffline (9002)
- InsufficientCapacity (9003)
- ServerGenerated503 (21008)
- Forbidden (403)
- ProvisionLimitReached (1005)
- DatabaseAccountNotFound (1008)
- RedundantCollectionPut (1009)
- SharedThroughputDatabaseQuotaExceeded (1010)
- SharedThroughputOfferGrowNotNeeded (1011)
- PartitionKeyQuotaOverLimit (1014)
- SharedThroughputDatabaseCollectionCountExceeded (1019)
- SharedThroughputDatabaseCountExceeded (1020)
- ComputeInternalError (1021)
- ThroughputCapQuotaExceeded (1028)
- InvalidThroughputCapValue (1029)
- RbacOperationNotSupported (5300)
- RbacUnauthorizedMetadataRequest (5301)
- RbacUnauthorizedNameBasedDataRequest (5302)
- RbacUnauthorizedRidBasedDataRequest (5303)
- RbacRidCannotBeResolved (5304)
- RbacMissingUserId (5305)
- RbacMissingAction (5306)
- RbacRequestWasNotAuthorized (5400)
- NspInboundDenied (5307)
- NspAuthorizationFailed (5308)
- NspNoResult (5309)
- NspInvalidParam (5310)
- NspInvalidEvalResult (5311)
- NspNotInitiated (5312)
- NspOperationNotSupported (5313)
- Request Timeout (408)

It also noted that since certain Gone (410) HTTP statuses and sub statuses are converted to Service Unavailable (503), they are eligible for per-partition automatic failover while others are not. Please refer to SdkDesign for more information.

Current base architecture

Currently we have support to per-partition automatic failover to regions in a couple of ways that give us limited to optimal support for successful per-partition automatic failover. Please refer to #2395.

The 1st being the control-plane meta data account information that is HTTP requested via the global account endpoint. If the SDK client is cold, which means that it is initialized for the first time, the SDK client has access to regions/locations that were managed and configured on the account level. If the SDK client is hot, which means that it has already been initialized on a previous request, the SDK client has access to the regions/locations that are cached in LocationCache to avoid making future HTTP requests to the gateway endpoint again. There are some triggers that invoke a refresh.

The 2nd being the ApplicationPreferredRegions on CosmosClientOptions that is set during design time by the customer within the SDK client.

When having both of these to leverage, the SDK client will give you the most optimal form of per-partition automatic failover when the failure criteria is met. It is also important to note that the EnablePartitionLevelFailover boolean flag must set to true on CosmosClientOptions in order for the logic for per-partition automatic failover to be executed. Having just one or the other gives the SDK client limited per-partition automatic failover support. Having neither will gives the SDK client no per-partition automatic failover support, and we will talk about that next.

Here is a more detailed breakdown and analysis of the current baseline architecture.

LIMITED: The current state of all cold start master partition failures that have application preferred regions set in CosmosClientOptions, irrespective of connectivity mode (Direct/Gateway) or operation (Data/Control), have limited failover to just the application preferred regions.
- Reason: Account level regions are not set because the master partition could not be reached to accept the control-plane meta data requests responsible for returning a list of available regions.
- Reason: Application preferred regions where set in the CosmosClientOptions.
LIMITED: The current state of all warm start master partition failures that do not have application preferred regions set in CosmosClientOptions, have limited failover to the account level regions, irrespective of connectivity mode or operation, irrespective of connectivity mode or operation.
- Reason: Account level regions are set because the master partition could be reached to accept the control-plane meta data requests responsible for returning a list of available regions.
- Reason: Application preferred regions where not set in the CosmosClientOptions.
LIMITED: The current state of cold and warm start server partition failures that do not have application preferred regions set in CosmosClientOptions, have limited failover to account the account level regions, irrespective of connectivity mode or operation.
- Reason: Account level regions are set because the master partition could be reached to accept the control-plane meta data requests responsible for returning a list of available regions.
- Reason: Application preferred regions where not set in the CosmosClientOptions.
BEST OPTIMAL OUTCOME: The current state of warm start master and server partition failures that have application preferred regions set in CosmosClientOptions, have a better failover strategy using both the application preferred regions and the account level regions, irrespective of connectivity mode or operation.
- Reason: Account level regions are set because the master partition could be reached to accept the control-plane meta data requests responsible for returning a list of available regions.
- Reason: Application preferred regions where set in the CosmosClientOptions.
BEST OPTIMAL OUTCOME: The current state of cold start server partition failures that have application preferred regions set in CosmosClientOptions, have a better failover strategy by using both the application preferred regions and the account level regions, irrespective of connectivity mode or operation.
- Reason: Account level regions are set because the master partition could be reached to accept the control-plane meta data requests responsible for returning a list of available regions.
- Reason: Application preferred regions where set in the CosmosClientOptions.

No per-partition automatic failover cases

For those cases where the SDK client is cold and the criterion for per-partition automatic failover is met while attempting to request control-plane meta data (account) information to access regions/locations, and the customer has not set ApplicationPreferredRegions on CosmosClientOptions, there is no per-partition automatic failover support, and will usually result in a online support call or a manual failover. For clarity and level-setting, a manual failover is when a read region is intentionally and manually, via Azure Portal, promoted to a write region, and the defaulted/preferred write region that is offline is demoted to a read region when it comes back online. To learn more, please refer to High Availability.

[insert visual aid here]

Proposed solution

It would be advantageous to enhance per-partition automatic failover within the SDK client by introducing DNS TXT records that is both configured and managed by the current ARM management workflow. More on this below below. The routing gateway team has already adopted this as a solution and is currently responsible for creating the DNS TXT records. It is up to the SDK team to enhance the SDK client to leverage these DNS TXT records in the event if there is no way to access account information from the gateway endpoints. The DNS TXT records will include other regional account names that the SDK client can iterate and cache once a successful request has been achieved. Next, we will talk about the 2 most reasonable solutions for querying DNS TXT records within the SDK client.

For clarity and level-setting, DNS TXT records are a type of Domain Name System (DNS) record in text format, which contain information about your domain.

[insert visual aid here]

Branch

https://github.com/Azure/azure-cosmos-dotnet-v3/tree/users/philipthomas-MSFT/per-partition-failover-dns-query-txt-records

DNS TXT record

Key (Global database account endpoint)

testaccount.srd.documents.azure.com

Value

{
    "domainName": "documents.azure.com",
    "globalDatabaseAccountName": "testaccount",
    "orderedRegionalAccountNames": [
        "testaccount-wus",
        "testaccount-eus",
        "testaccount-scus"
    ]
}

Configuration and management

Owned by the Routing Gateway and ARM Management Workflow Teams
- On msdata repository
- ProvisionDatabaseAccountWorkflow2.cs
- DatabaseAccountTXTRecord.cs
- PartitionFailoverRetryPolicy.cs
- PartitionFailoverRetryPolicyTests.cs
- GatewayDNSRecordProvider.cs
- Once account has been updated, and the PPAF is enabled on the account, DNS TXT records should be available at a maximum of 60 seconds.

Shading DNS client inside of SDK client

Pros
Cons

Shading DNS client inside hosted federated server

Pros
Cons
- Setup meeting to talk about why this is against best practices to host DNS lookups in contrast to local client. Include Josh Rowe.

Further below is a larger "exhaustive" table of dns solutions that in one way or the other, has more pros than cons.

Open-source software

DnsClient.NET

Performance

Latency
- Increase in latency is expected as attempts to communicate to potential endpoints is necessary.
Caching
- On the chance that the criterion for per-partition automatic failover is met, the new write region endpoint that the SDK client uses to successfully make a request will be cached to prevent the SDK client from needing to perform another DNS TXT record query and iterating through the list of regional account names.

Security

Validating
- On the chance that the criterion for per-partition automatic failover is met, and the DNS TXT record is queried, all regional account names must undergo some form of validation to avoid any opportunities for DNS spoofing, or man-in-the-middle attacks.

Areas of impact

Supportability

Client telemetry

https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/docs/observability.md
Distributed tracing

TBD

Diagnostic logging
Clarification
- Is the per-partition automatic failover logic including FailedReplicas corrected?
- Should RegionsContacted have a list of regions that where contacted?
- Create and issue for ContactedReplicas.
- StoreResponseStatistics should show failed endpoints.

Sample Diagnostics

  
    {
        "Summary": {
            "DirectCalls": {
                "(201, 0)": 1
            }
        },
        "name": "CreateItemAsync",
        "start datetime": "2023-06-08T14:58:45.537Z",
        "duration in milliseconds": 0.3899,
        "data": {
            "Client Configuration": {
                "Client Created Time Utc": "2023-06-08T14:58:45.1839346Z",
                "MachineId": "hashedMachineName:25bbdc53-3a51-4190-8877-5eafa4f5e7ac",
                "NumberOfClientsCreated": 1,
                "NumberOfActiveClients": 1,
                "ConnectionMode": "Direct",
                "User Agent": "cosmos-netstandard-sdk/3.34.0|1|X64|Microsoft Windows 10.0.22621|.NET 6.0.16|L|F 00000010|",
                "ConnectionConfig": {
                    "gw": "(cps:50, urto:10, p:False, httpf: True)",
                    "rntbd": "(cto: 5, icto: -1, mrpc: 30, mcpe: 65535, erd: True, pr: ReuseUnicastPort)",
                    "other": "(ed:False, be:False)"
                },
                "ConsistencyConfig": "(consistency: Strong, prgns:[East US, West US], apprgn: )",
                "ProcessorCount": 12
            }
        },
        "children": [
            {
                "name": "ItemSerialize",
                "duration in milliseconds": 0.0313
            },
            {
                "name": "Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler",
                "duration in milliseconds": 0.3051,
                "children": [
                    {
                        "name": "Get Collection Cache",
                        "duration in milliseconds": 0.0004
                    },
                    {
                        "name": "Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler",
                        "duration in milliseconds": 0.248,
                        "data": {
                            "System Info": {
                                "systemHistory": [
                                    {
                                        "dateUtc": "2023-06-08T14:58:45.1271989Z",
                                        "cpu": 3.442,
                                        "memory": 32641284.0,
                                        "threadInfo": {
                                            "isThreadStarving": "no info",
                                            "availableThreads": 32764,
                                            "minThreads": 12,
                                            "maxThreads": 32767
                                        },
                                        "numberOfOpenTcpConnection": 0
                                    }
                                ]
                            }
                        },
                        "children": [
                            {
                                "name": "Microsoft.Azure.Cosmos.Handlers.RetryHandler",
                                "duration in milliseconds": 0.2427,
                                "children": [
                                    {
                                        "name": "Microsoft.Azure.Cosmos.Handlers.RouterHandler",
                                        "duration in milliseconds": 0.2367,
                                        "children": [
                                            {
                                                "name": "Microsoft.Azure.Cosmos.Handlers.TransportHandler",
                                                "duration in milliseconds": 0.2353,
                                                "children": [
                                                    {
                                                        "name": "Microsoft.Azure.Documents.ServerStoreModel Transport Request",
                                                        "duration in milliseconds": 0.1958,
                                                        "data": {
                                                            "Client Side Request Stats": {
                                                                "Id": "AggregatedClientSideRequestStatistics",
                                                                "ContactedReplicas": [
                                                                    {
                                                                        "Count": 1,
                                                                        "Uri": "rntbd://cdb-ms-prod-westus-fd4.documents.azure.com:14382/apps/9dc0394e-d25f-4c98-baa5-72f1c700bf3e/services/060067c7-a4e9-4465-a412-25cb0104cb58/partitions/2cda760c-f81f-4094-85d0-7bcfb2acc4e6/replicas/132608933859499990p/"
                                                                    },
                                                                    {
                                                                        "Count": 1,
                                                                        "Uri": "rntbd://cdb-ms-prod-westus-fd4.documents.azure.com:14382/apps/9dc0394e-d25f-4c98-baa5-72f1c700bf3e/services/060067c7-a4e9-4465-a412-25cb0104cb58/partitions/2cda760c-f81f-4094-85d0-7bcfb2acc4e6/replicas/132608933859470000s/"
                                                                    },
                                                                    {
                                                                        "Count": 1,
                                                                        "Uri": "rntbd://cdb-ms-prod-westus-fd4.documents.azure.com:14382/apps/9dc0394e-d25f-4c98-baa5-72f1c700bf3e/services/060067c7-a4e9-4465-a412-25cb0104cb58/partitions/2cda760c-f81f-4094-85d0-7bcfb2acc4e6/replicas/132608933859471110s/"
                                                                    },
                                                                    {
                                                                        "Count": 1,
                                                                        "Uri": "rntbd://cdb-ms-prod-westus-fd4.documents.azure.com:14382/apps/9dc0394e-d25f-4c98-baa5-72f1c700bf3e/services/060067c7-a4e9-4465-a412-25cb0104cb58/partitions/2cda760c-f81f-4094-85d0-7bcfb2acc4e6/replicas/132608933859472220s/"
                                                                    }
                                                                ],
                                                                "RegionsContacted": [],
                                                                "FailedReplicas": [],
                                                                "AddressResolutionStatistics": [],
                                                                "StoreResponseStatistics": [
                                                                    {
                                                                        "ResponseTimeUTC": "2023-06-08T14:58:45.5379391Z",
                                                                        "ResourceType": "Document",
                                                                        "OperationType": "Create",
                                                                        "LocationEndpoint": "https://testserviceunavailableexceptionscenarioasync-westus.documents.azure.com/",
                                                                        "StoreResult": {
                                                                            "ActivityId": "5abbc286-b504-431f-85f5-00f595a9644c",
                                                                            "StatusCode": "Created",
                                                                            "SubStatusCode": "Unknown",
                                                                            "LSN": 58593,
                                                                            "PartitionKeyRangeId": "1",
                                                                            "GlobalCommittedLSN": 58593,
                                                                            "ItemLSN": -1,
                                                                            "UsingLocalLSN": false,
                                                                            "QuorumAckedLSN": -1,
                                                                            "SessionToken": null,
                                                                            "CurrentWriteQuorum": -1,
                                                                            "CurrentReplicaSetSize": -1,
                                                                            "NumberOfReadRegions": -1,
                                                                            "IsValid": true,
                                                                            "StorePhysicalAddress": "rntbd://cdb-ms-prod-westus-fd4.documents.azure.com:14382/apps/9dc0394e-d25f-4c98-baa5-72f1c700bf3e/services/060067c7-a4e9-4465-a412-25cb0104cb58/partitions/2cda760c-f81f-4094-85d0-7bcfb2acc4e6/replicas/132608933859499990p/",
                                                                            "RequestCharge": 0,
                                                                            "RetryAfterInMs": null,
                                                                            "BELatencyInMs": null,
                                                                            "transportRequestTimeline": null,
                                                                            "TransportException": null
                                                                        }
                                                                    }
                                                                ]
                                                            }
                                                        }
                                                    }
                                                ]
                                            }
                                        ]
                                    }
                                ]
                            }
                        ]
                    }
                ]
            },
            {
                "name": "Response Serialization",
                "duration in milliseconds": 0.0283
            }
        ]
    }

Testing

Use cases/scenarios

Please use Gherkin syntax (Given, When and Then)

Critical paths where there is no per-partition automatic failover support in the current baseline architecture

Cold SDK client, explicit data-plane operation, Implicit control-plane meta data (account) point of failure

Given a customer wants to create a new item (CreateItemAsync) using a cold Microsoft Azure Cosmos DB .NET SDK client,
    And the CosmosClientOptions has EnablePartitionLevelFailover set to true,
    And the ConnectivityMode is set to Direct or Gateway mode,
    And the CosmosClientOptions does not have ApplicationPreferredRegions set,
When the SDK client attempts to request control-plane meta data (account) information from the global database account gateway endpoint, the response is a ServiceUnavailable/Unknown, Forbidden/WriteForbidden, or a RequestTimeout HTTP status.
Then the SDK client will query DNS to get regional account names from the DNS TXT record based on the global database account endpoint,
    And the SDK client will iterate, validate and attempt to communicate with all region account names in the DNS TXT record until a read region is online,
    And the SDK client will promote that read region to primary write region,
    And the SDK client will cache the primary write region,
    And the original write region will demote to a read region once it is online.

Cold SDK client, explicit control-plane operation, implicit control-plane meta data (account) point of failure

Given a customer wants to create a new collection (CreateCollectionAsync) using a cold Microsoft Azure Cosmos DB .NET SDK client,
    And the CosmosClientOptions has EnablePartitionLevelFailover set to true,
    And the ConnectivityMode is set to Direct or Gateway mode,
    And the CosmosClientOptions does not have ApplicationPreferredRegions set,
When the SDK client attempts to request control-plane meta data (account) information from the global database account gateway endpoint, the response is a ServiceUnavailable/Unknown, Forbidden/WriteForbidden, or a RequestTimeout HTTP status.
Then the SDK client will query DNS to get regional account names from the DNS TXT record based on the global database account endpoint,
    And the SDK client will iterate, validate and attempt to communicate with all region account names in the DNS TXT record until a read region is online,
    And the SDK client will promote that read region to primary write region,
    And the SDK client will cache the primary write region,
    And the original write region will demote to a read region once it is online.

Unit (Gated pipeline)

DNS TXT record querying
Criterion for per-partition automatic failover
Caching, accessing and invalidating promoted write region
Feature flags

Emulator (Gated pipeline)

E2E

Performance/Benchmarking (Gated pipeline)

E2E
DNS TXT record querying

Security/Penetration (Gated pipeline)

DNS TXT record querying

DNS solution comparison matrix

	A	B	C	D
	Reference DnsClient.NET OSS libraries in SDK	Use native API in SDK	Shade DnsClient.NET OSS in SDK	Ask DNS team to add query capabilities
Pros	Community support to address any issues and/or bugs SDK team does not need to be an DNS expert DNS querying is hidden from CX	Code is owned by the SDK team and would require a certain level of expertise Can write code using iphlpapi.dll interop, specifically GetAdaptersAddresses and GetNetworkParams DNS querying is hidden from CX	Code can exist within the Azure Cosmos DB .NET SDK Code is owned by the SDK team and would require a certain level of expertise No need to write own code to support DNS querying DNS querying is hidden from CX	DNS querying is supported by the DNS team's existing library and not the SDK team and can be leverage for other initiative that require this type of capability No need to write own code to support DNS querying DNS querying is hidden from CX
Cons	Dependency requires Azure Central approval	SDK team would have to become SMEs for DNS This would be treated as a new project initiative Must meet multiple cross functional requirements and standards (Performance, Security, operating system Supportability, etc.) We would need to duplicate this across all Azure Cosmos DB SDKs	Any enhancements and bug fixes applied to the originating DnsClient.NET repo would need to be applied to the Azure Cosmos DB .NET SDK manually so staying fresh and up to date would be problematic Java and other Azure Cosmos DB SDKs would need separate implementations and could not leverage DnsClient.NET for DNS querying for SRV records	If the DNS team agreed to this, we are forced into the availability to development and timeline to deliver which would drastically affect the timeline and milestones for this project initiative
Notes	DnsClient Github OS		DnsClient Github OS LookupClient

	E	F	G	H
	Add endpoint to ToolsFederation	Add endpoint to Compute Gateway	Callback Function	Plugin Architecture
Pros	DNS querying is supported by the DNS team's existing library and not the SDK team and can be leverage for other initiative that require this type of capability.	The implementation can be shared across all Azure Cosmos DB SDKs (.NET, Java, etc.)	The implementation can be shared across all Azure Cosmos DB SDKs (.NET, Java, etc.) The signature and output would be consistent	The implementation can be shared across all Azure Cosmos DB SDKs (.NET, Java, etc.) The signature and output would be consistent Can use plugin using AssemblyDependencyResolver
Cons	It is considered a bad practice to perform DNS queries that are not resolved locally because of the options that system administrators could configure for DNS looking for documentation to support this claim	It is considered a bad practice to perform DNS queries that are not resolved locally because of the options that system administrators could configure for DNS looking for documentation to support this claim	Each CX is responsible for implementing "how" the DNS querying would work based on code the CX has to write Strict guidelines would be difficult to enforce The work required to actually achieve the SRV record is totally up to the CX Supportability is less than desirable. I can foresee many support incidents coming to the SDK team We would need to duplicate this across all Azure Cosmos DB SDKs	Each CX is responsible for implementing "how" the DNS querying would work based on code the CX has to write Strict guidelines would be difficult to enforce The work required to actually achieve the SRV record is totally up to the CX Supportability is less than desirable. I can foresee many support incidents coming to the SDK team We would need to duplicate this across all Azure Cosmos DB SDKs
Notes	Hosted on the ToolsFederation (same as ClientTelemetry endpoint) owned by Pedro Balaguer. negotiate who will do the work. Short term. Regional endpoints.	Hosted on Compute Gateway owned by Dinesh . Long term. Regional endpoints.	We would have to give some direction to customer on how to write code to query DNS and using callback function	We would have to give some direction to customer on how to write code to query DNS and using IOC plugin architecture

### Tasks
- [ ] https://github.com/Azure/azure-cosmos-dotnet-v3/issues/3978
- [ ] https://github.com/Azure/azure-cosmos-dotnet-v3/issues/3981
- [ ] https://github.com/Azure/azure-cosmos-dotnet-v3/issues/4181
- [ ] https://github.com/Azure/azure-cosmos-dotnet-v3/issues/4236
- [ ] https://github.com/Azure/azure-cosmos-dotnet-v3/issues/3977

mikaelhoral-microsoft commented 1 year ago

Scope of work section

Suggestion:

The Microsoft Azure Cosmos DB .NET SDK Version 3 for NoSQL API needs to achieve higher availability for strong consistency by implementing per-partition automatic failover for both single-region write accounts. The premise is that if communications to a partition, either server or master, meets the criterion for per-partition automatic failover, then the SDK will automatically try to promote the next available read region to a write region.

...

Strong consistency is supported in this iteration of development due to its guaranteed reads from the most recent committed version of an item. Although other consistency levels are pretermitted, there will be plans to support them in the future.

I would reword this. For example:

The Microsoft Azure Cosmos DB .NET SDK Version 3 for NoSQL API needs to be able to correctly respond to a backend partition failing over for single-region write accounts in order to achieve higher availability. Upon a backend partition - either server or master - failover, the SDK needs to automatically detect this condition and redirect subsequent write requests to the new write region for the partition. Per-partition automatic failover is initially rolled out only for select Strong Consistency accounts, but will later be available for all consistency levels.

mikaelhoral-microsoft commented 1 year ago

Criteria for per-partition automatic failover

Q: What do you mean by "Pretermitted HTTP sub statuses"? Are these HTTP status codes for which we explicitly do not want to failover?

mikaelhoral-microsoft commented 1 year ago

More generally, the Cosmos DB Core team proposes a solution where we should always attempt to retry any error in a different region (based on a region priority list), possibly after first retrying on a different in-region replica first UNLESS the error is a very specific one which clearly indicates that we shouldn't retry in a different region (e.g. a split where backend returns 410.1002). One rationale for this approach is because the Backend can't possibly know about each possible error condition and there are many errors the Backend is not in control over (network stack, service fabric just to name a few). I also spoke to Fabian about this and he agrees; a couple of points from my discussion with him:

There are error codes such as 403.3 that are surefire indications that we should retry in a different region - we should do that.
There are other errors codes (e.g., 410.1002, or say a user error code saying that "document too large" in which case we should not)
A lot of error codes may indicate that we should try a different replica in same region and/or different region - the logic for PPAF ought to more closely resemble what we already do in the multi-writer case here!

As for refreshing of partition state we propose that this is not done for PPAF. Once we establish that a partition has failed over (e.g., from region A to B) the SDK should add an override for that partition (pk range); this override remains in place until we see failures in the failover region (e.g. 403.3 in case of a "clean" failover) in which case we will retry regions again in priority order. Once we successfully establish a new region for the pk range we either clear override or add a new region as the override. What this means is that there is no need to talk to the PPAF "Fault Tolerant Store" (CASPaxos store); this is preferred given that CAXPaxos store is not scaled to handling high traffic loads and furthermore retries on a different replica and or region is cheap. Spoke to Fabian about this as well and we are largely aligned here; one concern he raised is that it will add more "speculative" attempts which may have an impact on customer RU consumption for example however the fact that customer needs to explicitly opt-in to PPAF largely alleviates this.

Best practices such as trying with exponential back-off should be employed. As for retrying writes, we should only do this on very specific error codes where we can be assured that the backend rejected the writes (403.3 for example).

philipthomas-MSFT commented 1 year ago

Based on our discussion @mikaelhoral-microsoft . Does this apply to just SDK, or also Routing Gateway? Because Routing Gateway is following the same criteria as we are with looking for specific status/substatus codes.

mikaelhoral-microsoft commented 1 year ago

Applies equally to Routing Gateway

mikaelhoral-microsoft commented 1 year ago

As discussed, in "No per-partition automatic failover cases" let's point to https://msdata.visualstudio.com/CosmosDB/_workitems/edit/2475521

mikaelhoral-microsoft commented 1 year ago

We are tracking SDK backlog items here: https://msdata.visualstudio.com/CosmosDB/_workitems/edit/2484362. These are from PPAF testing conducted by Backend team.

Azure / azure-cosmos-dotnet-v3

[Internal] Researching capabilities to support PPAF for .NET, Java SDK and Compute Gateway #3499

Purpose statement

Tasks

Stakeholders

Resources

Out of scope

Scope of work

Criteria for per-partition automatic failover

Current base architecture

No per-partition automatic failover cases

Proposed solution

Branch

DNS TXT record

Configuration and management

Shading DNS client inside of SDK client

Shading DNS client inside hosted federated server

Open-source software

Performance

Security

Areas of impact

Supportability

Client telemetry

Distributed tracing

Diagnostic logging

Testing

Use cases/scenarios

Critical paths where there is no per-partition automatic failover support in the current baseline architecture

Cold SDK client, explicit data-plane operation, Implicit control-plane meta data (account) point of failure

Cold SDK client, explicit control-plane operation, implicit control-plane meta data (account) point of failure

Unit (Gated pipeline)

Emulator (Gated pipeline)

Performance/Benchmarking (Gated pipeline)

Security/Penetration (Gated pipeline)

DNS solution comparison matrix