Azure / azure-cosmos-dotnet-v3

.NET SDK for Azure Cosmos DB for the core SQL API
MIT License
737 stars 494 forks source link

[Internal] Researching capabilities to support PPAF for .NET, Java SDK and Compute Gateway #3499

Closed philipthomas-MSFT closed 4 months ago

philipthomas-MSFT commented 1 year ago

Purpose statement

This is a document to enhance the Cosmos DB experience by achieving even higher availability.

Description: The Microsoft Azure Cosmos DB .NET SDK Version 3 plan to support per-partition automatic failover for data and control plane operations that are requested to server and master partitions, respectively for strong consistency. There will be a separate document to address the Java SDK. It also must be understood that this scope must reach over to Compute Gateway and Cassandra over Compute.

Tasks

Stakeholders

Resources

Out of scope

Scope of work

The Microsoft Azure Cosmos DB .NET SDK Version 3 for NoSQL API needs to achieve higher availability for strong consistency by implementing per-partition automatic failover for both single-region write accounts. The premise is that if communications to a partition, either server or master, meets the criterion for per-partition automatic failover, then the SDK will automatically try to promote the next available read region to a write region. This must also work for Cassandra API over Compute. Although other Cosmos DB APIs are not supported initially, the shared code base between SDK client and Compute should automagically work for them.

Strong consistency is supported in this iteration of development due to its guaranteed reads from the most recent committed version of an item. Although other consistency levels are pretermitted, there will be plans to support them in the future.

[insert visual aid here]

Before we continue, for clarity and level-setting, a server partition is a partition that is used for data-plane document operations. A master partition is a partition used for control-plane (database, collection, etc.) and meta data (account) operations for building and accessing location, collection and partition key range caches. Next, we will talk about the criteria for per-partition automatic failover.

Criteria for per-partition automatic failover

It is important to note, that there are other sub status codes that exist for these statuses. There are offline conversations happening with collaborating teams to determine if we need to expand our criteria for per-partition automatic failover or continue to pretermit them, but as of this moment, this the complete list. I will list the pretermitted HTTP statuses below.

It also noted that since certain Gone (410) HTTP statuses and sub statuses are converted to Service Unavailable (503), they are eligible for per-partition automatic failover while others are not. Please refer to SdkDesign for more information.

Current base architecture

Currently we have support to per-partition automatic failover to regions in a couple of ways that give us limited to optimal support for successful per-partition automatic failover. Please refer to #2395.

The 1st being the control-plane meta data account information that is HTTP requested via the global account endpoint. If the SDK client is cold, which means that it is initialized for the first time, the SDK client has access to regions/locations that were managed and configured on the account level. If the SDK client is hot, which means that it has already been initialized on a previous request, the SDK client has access to the regions/locations that are cached in LocationCache to avoid making future HTTP requests to the gateway endpoint again. There are some triggers that invoke a refresh.

The 2nd being the ApplicationPreferredRegions on CosmosClientOptions that is set during design time by the customer within the SDK client.

When having both of these to leverage, the SDK client will give you the most optimal form of per-partition automatic failover when the failure criteria is met. It is also important to note that the EnablePartitionLevelFailover boolean flag must set to true on CosmosClientOptions in order for the logic for per-partition automatic failover to be executed. Having just one or the other gives the SDK client limited per-partition automatic failover support. Having neither will gives the SDK client no per-partition automatic failover support, and we will talk about that next.

Here is a more detailed breakdown and analysis of the current baseline architecture.

No per-partition automatic failover cases

For those cases where the SDK client is cold and the criterion for per-partition automatic failover is met while attempting to request control-plane meta data (account) information to access regions/locations, and the customer has not set ApplicationPreferredRegions on CosmosClientOptions, there is no per-partition automatic failover support, and will usually result in a online support call or a manual failover. For clarity and level-setting, a manual failover is when a read region is intentionally and manually, via Azure Portal, promoted to a write region, and the defaulted/preferred write region that is offline is demoted to a read region when it comes back online. To learn more, please refer to High Availability.

[insert visual aid here]

Proposed solution

It would be advantageous to enhance per-partition automatic failover within the SDK client by introducing DNS TXT records that is both configured and managed by the current ARM management workflow. More on this below below. The routing gateway team has already adopted this as a solution and is currently responsible for creating the DNS TXT records. It is up to the SDK team to enhance the SDK client to leverage these DNS TXT records in the event if there is no way to access account information from the gateway endpoints. The DNS TXT records will include other regional account names that the SDK client can iterate and cache once a successful request has been achieved. Next, we will talk about the 2 most reasonable solutions for querying DNS TXT records within the SDK client.

For clarity and level-setting, DNS TXT records are a type of Domain Name System (DNS) record in text format, which contain information about your domain.

[insert visual aid here]

Branch

https://github.com/Azure/azure-cosmos-dotnet-v3/tree/users/philipthomas-MSFT/per-partition-failover-dns-query-txt-records

DNS TXT record

Key (Global database account endpoint)

testaccount.srd.documents.azure.com

Value

{
    "domainName": "documents.azure.com",
    "globalDatabaseAccountName": "testaccount",
    "orderedRegionalAccountNames": [
        "testaccount-wus",
        "testaccount-eus",
        "testaccount-scus"
    ]
}

Configuration and management

Shading DNS client inside of SDK client

Shading DNS client inside hosted federated server

Further below is a larger "exhaustive" table of dns solutions that in one way or the other, has more pros than cons.

Open-source software

Performance

Security

Areas of impact

Supportability

Client telemetry
Sample Diagnostics
  
    {
        "Summary": {
            "DirectCalls": {
                "(201, 0)": 1
            }
        },
        "name": "CreateItemAsync",
        "start datetime": "2023-06-08T14:58:45.537Z",
        "duration in milliseconds": 0.3899,
        "data": {
            "Client Configuration": {
                "Client Created Time Utc": "2023-06-08T14:58:45.1839346Z",
                "MachineId": "hashedMachineName:25bbdc53-3a51-4190-8877-5eafa4f5e7ac",
                "NumberOfClientsCreated": 1,
                "NumberOfActiveClients": 1,
                "ConnectionMode": "Direct",
                "User Agent": "cosmos-netstandard-sdk/3.34.0|1|X64|Microsoft Windows 10.0.22621|.NET 6.0.16|L|F 00000010|",
                "ConnectionConfig": {
                    "gw": "(cps:50, urto:10, p:False, httpf: True)",
                    "rntbd": "(cto: 5, icto: -1, mrpc: 30, mcpe: 65535, erd: True, pr: ReuseUnicastPort)",
                    "other": "(ed:False, be:False)"
                },
                "ConsistencyConfig": "(consistency: Strong, prgns:[East US, West US], apprgn: )",
                "ProcessorCount": 12
            }
        },
        "children": [
            {
                "name": "ItemSerialize",
                "duration in milliseconds": 0.0313
            },
            {
                "name": "Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler",
                "duration in milliseconds": 0.3051,
                "children": [
                    {
                        "name": "Get Collection Cache",
                        "duration in milliseconds": 0.0004
                    },
                    {
                        "name": "Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler",
                        "duration in milliseconds": 0.248,
                        "data": {
                            "System Info": {
                                "systemHistory": [
                                    {
                                        "dateUtc": "2023-06-08T14:58:45.1271989Z",
                                        "cpu": 3.442,
                                        "memory": 32641284.0,
                                        "threadInfo": {
                                            "isThreadStarving": "no info",
                                            "availableThreads": 32764,
                                            "minThreads": 12,
                                            "maxThreads": 32767
                                        },
                                        "numberOfOpenTcpConnection": 0
                                    }
                                ]
                            }
                        },
                        "children": [
                            {
                                "name": "Microsoft.Azure.Cosmos.Handlers.RetryHandler",
                                "duration in milliseconds": 0.2427,
                                "children": [
                                    {
                                        "name": "Microsoft.Azure.Cosmos.Handlers.RouterHandler",
                                        "duration in milliseconds": 0.2367,
                                        "children": [
                                            {
                                                "name": "Microsoft.Azure.Cosmos.Handlers.TransportHandler",
                                                "duration in milliseconds": 0.2353,
                                                "children": [
                                                    {
                                                        "name": "Microsoft.Azure.Documents.ServerStoreModel Transport Request",
                                                        "duration in milliseconds": 0.1958,
                                                        "data": {
                                                            "Client Side Request Stats": {
                                                                "Id": "AggregatedClientSideRequestStatistics",
                                                                "ContactedReplicas": [
                                                                    {
                                                                        "Count": 1,
                                                                        "Uri": "rntbd://cdb-ms-prod-westus-fd4.documents.azure.com:14382/apps/9dc0394e-d25f-4c98-baa5-72f1c700bf3e/services/060067c7-a4e9-4465-a412-25cb0104cb58/partitions/2cda760c-f81f-4094-85d0-7bcfb2acc4e6/replicas/132608933859499990p/"
                                                                    },
                                                                    {
                                                                        "Count": 1,
                                                                        "Uri": "rntbd://cdb-ms-prod-westus-fd4.documents.azure.com:14382/apps/9dc0394e-d25f-4c98-baa5-72f1c700bf3e/services/060067c7-a4e9-4465-a412-25cb0104cb58/partitions/2cda760c-f81f-4094-85d0-7bcfb2acc4e6/replicas/132608933859470000s/"
                                                                    },
                                                                    {
                                                                        "Count": 1,
                                                                        "Uri": "rntbd://cdb-ms-prod-westus-fd4.documents.azure.com:14382/apps/9dc0394e-d25f-4c98-baa5-72f1c700bf3e/services/060067c7-a4e9-4465-a412-25cb0104cb58/partitions/2cda760c-f81f-4094-85d0-7bcfb2acc4e6/replicas/132608933859471110s/"
                                                                    },
                                                                    {
                                                                        "Count": 1,
                                                                        "Uri": "rntbd://cdb-ms-prod-westus-fd4.documents.azure.com:14382/apps/9dc0394e-d25f-4c98-baa5-72f1c700bf3e/services/060067c7-a4e9-4465-a412-25cb0104cb58/partitions/2cda760c-f81f-4094-85d0-7bcfb2acc4e6/replicas/132608933859472220s/"
                                                                    }
                                                                ],
                                                                "RegionsContacted": [],
                                                                "FailedReplicas": [],
                                                                "AddressResolutionStatistics": [],
                                                                "StoreResponseStatistics": [
                                                                    {
                                                                        "ResponseTimeUTC": "2023-06-08T14:58:45.5379391Z",
                                                                        "ResourceType": "Document",
                                                                        "OperationType": "Create",
                                                                        "LocationEndpoint": "https://testserviceunavailableexceptionscenarioasync-westus.documents.azure.com/",
                                                                        "StoreResult": {
                                                                            "ActivityId": "5abbc286-b504-431f-85f5-00f595a9644c",
                                                                            "StatusCode": "Created",
                                                                            "SubStatusCode": "Unknown",
                                                                            "LSN": 58593,
                                                                            "PartitionKeyRangeId": "1",
                                                                            "GlobalCommittedLSN": 58593,
                                                                            "ItemLSN": -1,
                                                                            "UsingLocalLSN": false,
                                                                            "QuorumAckedLSN": -1,
                                                                            "SessionToken": null,
                                                                            "CurrentWriteQuorum": -1,
                                                                            "CurrentReplicaSetSize": -1,
                                                                            "NumberOfReadRegions": -1,
                                                                            "IsValid": true,
                                                                            "StorePhysicalAddress": "rntbd://cdb-ms-prod-westus-fd4.documents.azure.com:14382/apps/9dc0394e-d25f-4c98-baa5-72f1c700bf3e/services/060067c7-a4e9-4465-a412-25cb0104cb58/partitions/2cda760c-f81f-4094-85d0-7bcfb2acc4e6/replicas/132608933859499990p/",
                                                                            "RequestCharge": 0,
                                                                            "RetryAfterInMs": null,
                                                                            "BELatencyInMs": null,
                                                                            "transportRequestTimeline": null,
                                                                            "TransportException": null
                                                                        }
                                                                    }
                                                                ]
                                                            }
                                                        }
                                                    }
                                                ]
                                            }
                                        ]
                                    }
                                ]
                            }
                        ]
                    }
                ]
            },
            {
                "name": "Response Serialization",
                "duration in milliseconds": 0.0283
            }
        ]
    }
  
 

Testing

Use cases/scenarios

Please use Gherkin syntax (Given, When and Then)

Critical paths where there is no per-partition automatic failover support in the current baseline architecture

Cold SDK client, explicit data-plane operation, Implicit control-plane meta data (account) point of failure

Given a customer wants to create a new item (CreateItemAsync) using a cold Microsoft Azure Cosmos DB .NET SDK client,
    And the CosmosClientOptions has EnablePartitionLevelFailover set to true,
    And the ConnectivityMode is set to Direct or Gateway mode,
    And the CosmosClientOptions does not have ApplicationPreferredRegions set,
When the SDK client attempts to request control-plane meta data (account) information from the global database account gateway endpoint, the response is a ServiceUnavailable/Unknown, Forbidden/WriteForbidden, or a RequestTimeout HTTP status.
Then the SDK client will query DNS to get regional account names from the DNS TXT record based on the global database account endpoint,
    And the SDK client will iterate, validate and attempt to communicate with all region account names in the DNS TXT record until a read region is online,
    And the SDK client will promote that read region to primary write region,
    And the SDK client will cache the primary write region,
    And the original write region will demote to a read region once it is online.

Cold SDK client, explicit control-plane operation, implicit control-plane meta data (account) point of failure

Given a customer wants to create a new collection (CreateCollectionAsync) using a cold Microsoft Azure Cosmos DB .NET SDK client,
    And the CosmosClientOptions has EnablePartitionLevelFailover set to true,
    And the ConnectivityMode is set to Direct or Gateway mode,
    And the CosmosClientOptions does not have ApplicationPreferredRegions set,
When the SDK client attempts to request control-plane meta data (account) information from the global database account gateway endpoint, the response is a ServiceUnavailable/Unknown, Forbidden/WriteForbidden, or a RequestTimeout HTTP status.
Then the SDK client will query DNS to get regional account names from the DNS TXT record based on the global database account endpoint,
    And the SDK client will iterate, validate and attempt to communicate with all region account names in the DNS TXT record until a read region is online,
    And the SDK client will promote that read region to primary write region,
    And the SDK client will cache the primary write region,
    And the original write region will demote to a read region once it is online.

Unit (Gated pipeline)

Emulator (Gated pipeline)

Performance/Benchmarking (Gated pipeline)

Security/Penetration (Gated pipeline)

DNS solution comparison matrix

  A B C D
  Reference DnsClient.NET OSS libraries in SDK Use native API in SDK Shade DnsClient.NET OSS in SDK Ask DNS team to add query capabilities
Pros
  • Community support to address any issues and/or bugs
  • SDK team does not need to be an DNS expert
  • DNS querying is hidden from CX
  • Code is owned by the SDK team and would require a certain level of expertise
  • Can write code using iphlpapi.dll interop, specifically GetAdaptersAddresses and GetNetworkParams
  • DNS querying is hidden from CX
  • Code can exist within the Azure Cosmos DB .NET SDK
  • Code is owned by the SDK team and would require a certain level of expertise
  • No need to write own code to support DNS querying
  • DNS querying is hidden from CX
  • DNS querying is supported by the DNS team's existing library and not the SDK team and can be leverage for other initiative that require this type of capability
  • No need to write own code to support DNS querying
  • DNS querying is hidden from CX
Cons
  • Dependency requires Azure Central approval
  • SDK team would have to become SMEs for DNS
  • This would be treated as a new project initiative
  • Must meet multiple cross functional requirements and standards (Performance, Security, operating system Supportability, etc.)
  • We would need to duplicate this across all Azure Cosmos DB SDKs
  • Any enhancements and bug fixes applied to the originating DnsClient.NET repo would need to be applied to the Azure Cosmos DB .NET SDK manually so staying fresh and up to date would be problematic
  • Java and other Azure Cosmos DB SDKs would need separate implementations and could not leverage DnsClient.NET for DNS querying for SRV records
  • If the DNS team agreed to this, we are forced into the availability to development and timeline to deliver which would drastically affect the timeline and milestones for this project initiative
 Notes DnsClient Github OS   DnsClient Github OS LookupClient  
 
  E F G H
  Add endpoint to ToolsFederation Add endpoint to Compute Gateway Callback Function Plugin Architecture
Pros
  • DNS querying is supported by the DNS team's existing library and not the SDK team and can be leverage for other initiative that require this type of capability.
  • The implementation can be shared across all Azure Cosmos DB SDKs (.NET, Java, etc.)
  • The implementation can be shared across all Azure Cosmos DB SDKs (.NET, Java, etc.)
  • The signature and output would be consistent
  • The implementation can be shared across all Azure Cosmos DB SDKs (.NET, Java, etc.)
  • The signature and output would be consistent
  • Can use plugin using AssemblyDependencyResolver
Cons
  • It is considered a bad practice to perform DNS queries that are not resolved locally because of the options that system administrators could configure for DNS
    • looking for documentation to support this claim
  • It is considered a bad practice to perform DNS queries that are not resolved locally because of the options that system administrators could configure for DNS
    • looking for documentation to support this claim
  • Each CX is responsible for implementing "how" the DNS querying would work based on code the CX has to write
  • Strict guidelines would be difficult to enforce
  • The work required to actually achieve the SRV record is totally up to the CX
  • Supportability is less than desirable. I can foresee many support incidents coming to the SDK team
  • We would need to duplicate this across all Azure Cosmos DB SDKs
  • Each CX is responsible for implementing "how" the DNS querying would work based on code the CX has to write
  • Strict guidelines would be difficult to enforce
  • The work required to actually achieve the SRV record is totally up to the CX
  • Supportability is less than desirable. I can foresee many support incidents coming to the SDK team
  • We would need to duplicate this across all Azure Cosmos DB SDKs
 Notes Hosted on the ToolsFederation (same as ClientTelemetry endpoint) owned by Pedro Balaguer. negotiate who will do the work. Short term. Regional endpoints. Hosted on Compute Gateway owned by Dinesh . Long term. Regional endpoints. We would have to give some direction to customer on how to write code to query DNS and using callback function We would have to give some direction to customer on how to write code to query DNS and using IOC plugin architecture
 
### Tasks
- [ ] https://github.com/Azure/azure-cosmos-dotnet-v3/issues/3978
- [ ] https://github.com/Azure/azure-cosmos-dotnet-v3/issues/3981
- [ ] https://github.com/Azure/azure-cosmos-dotnet-v3/issues/4181
- [ ] https://github.com/Azure/azure-cosmos-dotnet-v3/issues/4236
- [ ] https://github.com/Azure/azure-cosmos-dotnet-v3/issues/3977
mikaelhoral-microsoft commented 1 year ago

Scope of work section

Suggestion:

The Microsoft Azure Cosmos DB .NET SDK Version 3 for NoSQL API needs to achieve higher availability for strong consistency by implementing per-partition automatic failover for both single-region write accounts. The premise is that if communications to a partition, either server or master, meets the criterion for per-partition automatic failover, then the SDK will automatically try to promote the next available read region to a write region.

...

Strong consistency is supported in this iteration of development due to its guaranteed reads from the most recent committed version of an item. Although other consistency levels are pretermitted, there will be plans to support them in the future.

I would reword this. For example:

The Microsoft Azure Cosmos DB .NET SDK Version 3 for NoSQL API needs to be able to correctly respond to a backend partition failing over for single-region write accounts in order to achieve higher availability. Upon a backend partition - either server or master - failover, the SDK needs to automatically detect this condition and redirect subsequent write requests to the new write region for the partition. Per-partition automatic failover is initially rolled out only for select Strong Consistency accounts, but will later be available for all consistency levels.

mikaelhoral-microsoft commented 1 year ago

Criteria for per-partition automatic failover

Q: What do you mean by "Pretermitted HTTP sub statuses"? Are these HTTP status codes for which we explicitly do not want to failover?

mikaelhoral-microsoft commented 1 year ago

More generally, the Cosmos DB Core team proposes a solution where we should always attempt to retry any error in a different region (based on a region priority list), possibly after first retrying on a different in-region replica first UNLESS the error is a very specific one which clearly indicates that we shouldn't retry in a different region (e.g. a split where backend returns 410.1002). One rationale for this approach is because the Backend can't possibly know about each possible error condition and there are many errors the Backend is not in control over (network stack, service fabric just to name a few). I also spoke to Fabian about this and he agrees; a couple of points from my discussion with him:

As for refreshing of partition state we propose that this is not done for PPAF. Once we establish that a partition has failed over (e.g., from region A to B) the SDK should add an override for that partition (pk range); this override remains in place until we see failures in the failover region (e.g. 403.3 in case of a "clean" failover) in which case we will retry regions again in priority order. Once we successfully establish a new region for the pk range we either clear override or add a new region as the override. What this means is that there is no need to talk to the PPAF "Fault Tolerant Store" (CASPaxos store); this is preferred given that CAXPaxos store is not scaled to handling high traffic loads and furthermore retries on a different replica and or region is cheap. Spoke to Fabian about this as well and we are largely aligned here; one concern he raised is that it will add more "speculative" attempts which may have an impact on customer RU consumption for example however the fact that customer needs to explicitly opt-in to PPAF largely alleviates this.

Best practices such as trying with exponential back-off should be employed. As for retrying writes, we should only do this on very specific error codes where we can be assured that the backend rejected the writes (403.3 for example).

philipthomas-MSFT commented 1 year ago

Based on our discussion @mikaelhoral-microsoft . Does this apply to just SDK, or also Routing Gateway? Because Routing Gateway is following the same criteria as we are with looking for specific status/substatus codes.

mikaelhoral-microsoft commented 1 year ago

Applies equally to Routing Gateway

mikaelhoral-microsoft commented 1 year ago

As discussed, in "No per-partition automatic failover cases" let's point to https://msdata.visualstudio.com/CosmosDB/_workitems/edit/2475521

mikaelhoral-microsoft commented 1 year ago

We are tracking SDK backlog items here: https://msdata.visualstudio.com/CosmosDB/_workitems/edit/2484362. These are from PPAF testing conducted by Backend team.