Azure / azure-cosmos-dotnet-v3

.NET SDK for Azure Cosmos DB for the core SQL API
MIT License
741 stars 494 forks source link

CosmosClientOptions.Request is not respected. #4521

Open AchoArnold opened 5 months ago

AchoArnold commented 5 months ago

We are continuously addressing and improving the SDK, if possible, make sure the problem persist in the latest SDK version.

Describe the bug

The CosmosClientOptions.RequestTimeout property of the cosmosDB SDK is not respected. Even when I set the timeout to something really small like 1 millisecond, my cosmosDB operations are still successfully but I expect them to fail.

To Reproduce

I created code here which you can reproduce the issue. I expected this code to throw an exception since the timeout is set to 1 millisecond but it doesn't throw an exception when I do operations like cosmosClient.GetDatabase("arnold-db"); it also doesn't throw an exception when I do an insert query wait container.CreateItemAsync(item);

using Microsoft.Azure.Cosmos;
using Newtonsoft.Json;

namespace Experiments;

public static class RequestTimeoutTest
{
    private const string cosmosDBConnectionString = "";

    private record Item
    {
        [JsonProperty("id")]
        public string Id { get; set; } = string.Empty;

        [JsonProperty("pk")]
        public string PartitionKey { get; set; } = string.Empty;

        public string Name { get; set; } = string.Empty;
    }

    public static async Task Run()
    {
        var options = new CosmosClientOptions
        {
            ConnectionMode = ConnectionMode.Gateway,
            RequestTimeout = TimeSpan.FromMilliseconds(1),

            // Even if I remove this `HttpClientFactory` the SDK doesn't throw timeout exceptions
            HttpClientFactory = () => new HttpClient
            {
                Timeout = TimeSpan.FromMilliseconds(1)
            }
        };

        var cosmosClient = new CosmosClient(cosmosDBConnectionString, options);
        var database = cosmosClient.GetDatabase("arnold-db");

        var container = database.GetContainer("arnold-container");

        var item = new Item
        {
            PartitionKey = Guid.NewGuid().ToString(),
            Id = Guid.NewGuid().ToString(),
            Name = "Arnold"
        };

        await container.CreateItemAsync(item);
    }
}

Expected behavior A clear and concise description of what you expected to happen.

I expected the cosmosDB SDK to throw an exception after the timeout which is set to 1 millisecond both on the

Actual behavior

The cosmosDB SDK performs all my queries without throwing any exception even though the queries go above the 1 millisecond timeout which I set.

Environment summary SDK Version: 3.40.0 (latest) OS Version (e.g. Windows, Linux, MacOSX): Windows, Linux

Additional context Add any other context about the problem here (for example, complete stack traces or logs).

For debugging the issue, I changed the comosDB SDK to point to a server with an infinite wait period I could see that the SDK is trying to use a 65000 millisecond timeout in the http client.|

Unhandled exception. Microsoft.Azure.Cosmos.CosmosException : Response status code does not indicate success: RequestTimeout (408); Substatus: 0; ActivityId: 13295886-c458-46ac-bf9d-7107b7c07f28; Reason: (GatewayStoreClient Request Timeout. Start Time UTC:31-May-24 16:01:33; Total Duration:65025.016 Ms; Request Timeout 65000 Ms; Http Client Timeout:65000 Ms; Activity id: 13295886-c458-46ac-bf9d-7107b7c07f28;);
 ---> System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at System.Net.Http.HttpClient.HandleFailure(Exception e, Boolean telemetryStarted, HttpResponseMessage response, CancellationTokenSource cts, CancellationToken cancellationToken, CancellationTokenSource pendingRequestsCts)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
   at Microsoft.Azure.Cosmos.CosmosHttpClientCore.ExecuteHttpHelperAsync(HttpRequestMessage requestMessage, ResourceType resourceType, CancellationToken cancellationToken)
   at Microsoft.Azure.Cosmos.CosmosHttpClientCore.SendHttpHelperAsync(Func`1 createRequestMessageAsync, ResourceType resourceType, HttpTimeoutPolicy timeoutPolicy, IClientSideRequestStatistics clientSideRequestStatistics, CancellationToken cancellationToken)
   at Microsoft.Azure.Cosmos.GatewayStoreClient.InvokeAsync(DocumentServiceRequest request, ResourceType resourceType, Uri physicalAddress, CancellationToken cancellationToken)
   at Microsoft.Azure.Cosmos.GatewayStoreModel.ProcessMessageAsync(DocumentServiceRequest request, CancellationToken cancellationToken)
   at Microsoft.Azure.Cosmos.Handlers.TransportHandler.ProcessMessageAsync(RequestMessage request, CancellationToken cancellationToken)
   at Microsoft.Azure.Cosmos.Handlers.TransportHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at Microsoft.Azure.Cosmos.ResponseMessage.EnsureSuccessStatusCode()
   at Microsoft.Azure.Cosmos.CosmosResponseFactoryCore.ProcessMessage[T](ResponseMessage responseMessage, Func`2 createResponse)
   at Microsoft.Azure.Cosmos.CosmosResponseFactoryCore.CreateItemResponse[T](ResponseMessage responseMessage)
   at Microsoft.Azure.Cosmos.ContainerCore.CreateItemAsync[T](T item, ITrace trace, Nullable`1 partitionKey, ItemRequestOptions requestOptions, CancellationToken cancellationToken)
   at Microsoft.Azure.Cosmos.ClientContextCore.RunWithDiagnosticsHelperAsync[TResult](String containerName, String databaseName, OperationType operationType, ITrace trace, Func`2 task, Func`2 openTelemetry, String operationName, RequestOptions requestOptions)
   at Microsoft.Azure.Cosmos.ClientContextCore.OperationHelperWithRootTraceAsync[TResult](String operationName, String containerName, String databaseName, OperationType operationType, RequestOptions requestOptions, Func`2 task, Func`2 openTelemetry, TraceComponent traceComponent, TraceLevel traceLevel)
   at Experiments.CosmosDb.Run(ServiceProvider serviceProvider) in C:\Users\arnoldacho\RiderProjects\TestApp\Experiments\CosmosDb.cs:line 51
   at Program.<Main>$(String[] args) in C:\Users\arnoldacho\RiderProjects\TestApp\Experiments\Program.cs:line 26
   at Program.<Main>(String[] args)
--- Cosmos Diagnostics ---{"Summary":{"GatewayCalls":{"(200, 0)":1,"(0, 0)":1}},"name":"CreateItemAsync","start datetime":"2024-05-31T16:01:32.993Z","duration in milliseconds":65689.9361,"data":{"Client Configuration":{"Client Created Time Utc":"2024-05-31T16:01:32.9239267Z","MachineId":"hashedMachineName:05f7e03b-03c7-35b5-a66a-afaf89d20c92","NumberOfClientsCreated":1,"NumberOfActiveClients":1,"ConnectionMode":"Gateway","User Agent":"cosmos-netstandard-sdk/3.40.0|1|X64|Microsoft Windows 10.0.22631|.NET 6.0.31|N|F 00000010|","ConnectionConfig":{"gw":"(cps:50, urto:0, p:False, httpf: True)","rntbd":"(cto: 5, icto: -1, mrpc: 30, mcpe: 65535, erd: True, pr: ReuseUnicastPort)","other":"(ed:False, be:False)"},"ConsistencyConfig":"(consistency: NotSet, prgns:[], apprgn: )","ProcessorCount":16}},"children":[{"name":"ItemSerialize","duration in milliseconds":96.5716},{"name":"Get PkValue From Stream","duration in milliseconds":514.0507,"children":[{"name":"Get Collection Cache","duration in milliseconds":194.0417,"children":[{"name":"Waiting for Initialization of client to complete","duration in milliseconds":192.9011}]},{"name":"Read Collection","duration in milliseconds":272.188,"data":{"Client Side Request Stats":{"Id":"AggregatedClientSideRequestStatistics","ContactedReplicas":[],"RegionsContacted":[],"FailedReplicas":[],"AddressResolutionStatistics":[],"StoreResponseStatistics":[],"HttpResponseStats":[{"StartTimeUTC":"2024-05-31T16:01:33.3251588Z","DurationInMs":230.072,"RequestUri":"https://[REDACTED].documents.azure.com/dbs/arnold-db/colls/arnold-container","ResourceType":"Collection","HttpMethod":"GET","ActivityId":"e937ed4e-108f-4b38-81d8-8d223f7cb1a3","StatusCode":"OK"}]}}}]},{"name":"Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler","duration in milliseconds":65059.8002,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler","duration in milliseconds":65050.3937,"data":{"System Info":{"systemHistory":[{"dateUtc":"2024-05-31T16:01:43.2252194Z","cpu":2.125,"memory":34379288.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.1599,"availableThreads":32766,"minThreads":16,"maxThreads":32767},"numberOfOpenTcpConnection":0},{"dateUtc":"2024-05-31T16:01:53.2405967Z","cpu":1.648,"memory":34486652.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.1172,"availableThreads":32766,"minThreads":16,"maxThreads":32767},"numberOfOpenTcpConnection":0},{"dateUtc":"2024-05-31T16:02:03.2551958Z","cpu":2.720,"memory":34394436.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.0473,"availableThreads":32766,"minThreads":16,"maxThreads":32767},"numberOfOpenTcpConnection":0},{"dateUtc":"2024-05-31T16:02:13.2695657Z","cpu":2.857,"memory":34189348.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.206,"availableThreads":32766,"minThreads":16,"maxThreads":32767},"numberOfOpenTcpConnection":0},{"dateUtc":"2024-05-31T16:02:23.2828256Z","cpu":1.404,"memory":34239112.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.0577,"availableThreads":32766,"minThreads":16,"maxThreads":32767},"numberOfOpenTcpConnection":0},{"dateUtc":"2024-05-31T16:02:33.2911877Z","cpu":1.855,"memory":34201192.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.06,"availableThreads":32766,"minThreads":16,"maxThreads":32767},"numberOfOpenTcpConnection":0}]}},"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.TelemetryHandler","duration in milliseconds":65049.3471,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.RetryHandler","duration in milliseconds":65047.9304,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.RouterHandler","duration in milliseconds":65037.4461,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.TransportHandler","duration in milliseconds":65036.4206,"children":[{"name":"Microsoft.Azure.Cosmos.GatewayStoreModel Transport Request","duration in milliseconds":65031.1886,"data":{"Client Side Request Stats":{"Id":"AggregatedClientSideRequestStatistics","ContactedReplicas":[],"RegionsContacted":[],"FailedReplicas":[],"AddressResolutionStatistics":[],"StoreResponseStatistics":[],"HttpResponseStats":[{"StartTimeUTC":"2024-05-31T16:01:33.6396298Z","DurationInMs":65021.2466,"RequestUri":"https://[REDACTED]-westeurope.documents.azure.com/dbs/arnold-db/colls/arnold-container/docs","ResourceType":"Document","HttpMethod":"POST","ActivityId":"13295886-c458-46ac-bf9d-7107b7c07f28","ExceptionType":"System.Threading.Tasks.TaskCanceledException","ExceptionMessage":"A task was canceled."}]}}}]}]}]}]}]}]},{"name":"Get Collection Cache","duration in milliseconds":0.0052}]}
ealsur commented 4 months ago

@AchoArnold - Can you explain what's the scenario for a 1ms timeout? Your test is configuring the client in a way that it will fail initialization.

The client needs to, on the first request (unless you are using CreateAndInitializeAsync), discover the account information and obtain metadata. These are not Data Plane operations and are not governed by the user's RequestTimeout (because this setting is mainly for data plane operations). The way you are configuring things, it will simply fail to initialize. The SDK will attempt to recover retrying for 65 seconds.

AchoArnold commented 4 months ago

Hello @ealsur

Can you explain what's the scenario for a 1ms timeout? Your test is configuring the client in a way that it will fail initialization.

The problem is not the 1ms timeout. I set the 1ms timeout intentionally to expose the problem. The issue is on our application sometimes cosmosDB SDK takes more than 65 seconds to perform a DB operation even though our timeout configured is less than 65 seconds.

The way you are configuring things, it will simply fail to initialize. The SDK will attempt to recover retrying for 65 seconds.

The big question is "Why?". If I do set the timeout as n seconds I expect it to be respected. I don't espect the SDK to overwrite my timeout with 65 seconds when there is a failure.

Also looking through the code, It seems the cosmosDB is intentionally overwriting the timeout here https://github.com/Azure/azure-cosmos-dotnet-v3/blob/b9b35bb92d5b0c075259a4d78287bff0f66c9861/Microsoft.Azure.Cosmos/src/HttpClient/CosmosHttpClient.cs#L16

These are not Data Plane operations and are not governed by the user's RequestTimeout

I actually did do a more ehanced test where I separated data plane operations, e.g PUT requests to the cosmosDB datbase from the discoverty GET requests that the nugget does internally and the timeout still wasn't respected.

ealsur commented 4 months ago

That's a fair question, thanks for clarifying.

There are two aspects to be considered:

  1. RequestTimeout is not End ot End Timeout. You can have RequestTimeout 10 seconds and the operation can take 2 minutes. The reason: Retries. There might be conditions on the service that signal the SDK to retry or do something different (for example, handling of a region failing over) or connectivity issues. If you want control over End to End, please use a CancellationToken.
    1. Please see https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/troubleshoot-dotnet-sdk-request-timeout?tabs=cpu-new#customize-the-timeout-on-the-azure-cosmos-db-net-sdk regarding the above.
    2. https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/docs/SdkDesign.md#cross-region-retries this document contains retry semantics.
  2. There are two groups of operations on the SDK, metadata and data plane. Metadata operations are not related to user data, but information that is required by the SDK to route or execute data plane operations (obtaining the partition lists, discovering the account details, etc). These are not governed by the RequestTimeout you set, They can have higher latencies and a user configuration should not make them fail.
AchoArnold commented 4 months ago

There are two groups of operations on the SDK, metadata and data plane. Metadata operations are not related to user data, but information that is required by the SDK to route or execute data plane operations (obtaining the partition lists, discovering the account details, etc). These are not governed by the RequestTimeout you set, They can have higher latencies and a user configuration should not make them fail.

The request timeout is not respected for BOTH metadata and data plane operations. Both operations should fail if the request timeout is set to 1millisecond because we cannot do an HTTP request in 1 millisecond but when I run the code, it actually carries out the database operation successfully. I had to implement an HTTP server which waits for ever and then the cosmosDB SDK will itself timeout after 65 seconds.

RequestTimeout is not End ot End Timeout. You can have RequestTimeout 10 seconds and the operation can take 2 minutes. The reason: Retries.

We're aware of using cancellation tokens. The issue here is about request timeout and for even 1 http operation. the request timeout is not respected. This doesn't include retires I mean for 1 single http request. the request timeout is not respected both for metadata and data plane operations.

ealsur commented 4 months ago

Investigating further, turns out that for SDKs on Gateway mode, the RequestTimeout is only applied if the value is > 60 seconds:

https://github.com/Azure/azure-cosmos-dotnet-v3/blob/68f9725e43f0f8e3a60479d92007fa525f112d16/Microsoft.Azure.Cosmos/src/HttpClient/CosmosHttpClientCore.cs#L235-L237

Looking back at the history, this behavior is coming from the initial code of V3 which is also coming from V2 SDK, so the reasoning might be to maintain consistency with the previous SDK.

The SDK however has a set of HttpRetryPolicies: https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/docs/SdkDesign.md#http-retry-layer

They work over this RequestTimeout enforcing other latency guarantees. But the last retry is always 60 seconds to accommodate the behavior of Gateway, which might materialize failover failures with higher latency. If we allowed the user RequestTimeout to exit earlier, the client would never receive these signals that mark regions unavailable.