Open AchoArnold opened 5 months ago
@AchoArnold - Can you explain what's the scenario for a 1ms timeout? Your test is configuring the client in a way that it will fail initialization.
The client needs to, on the first request (unless you are using CreateAndInitializeAsync
), discover the account information and obtain metadata. These are not Data Plane operations and are not governed by the user's RequestTimeout (because this setting is mainly for data plane operations). The way you are configuring things, it will simply fail to initialize. The SDK will attempt to recover retrying for 65 seconds.
Hello @ealsur
Can you explain what's the scenario for a 1ms timeout? Your test is configuring the client in a way that it will fail initialization.
The problem is not the 1ms
timeout. I set the 1ms
timeout intentionally to expose the problem. The issue is on our application sometimes cosmosDB SDK takes more than 65 seconds to perform a DB operation even though our timeout configured is less than 65 seconds.
The way you are configuring things, it will simply fail to initialize. The SDK will attempt to recover retrying for 65 seconds.
The big question is "Why?". If I do set the timeout as n seconds
I expect it to be respected. I don't espect the SDK to overwrite my timeout with 65 seconds when there is a failure.
Also looking through the code, It seems the cosmosDB is intentionally overwriting the timeout here https://github.com/Azure/azure-cosmos-dotnet-v3/blob/b9b35bb92d5b0c075259a4d78287bff0f66c9861/Microsoft.Azure.Cosmos/src/HttpClient/CosmosHttpClient.cs#L16
These are not Data Plane operations and are not governed by the user's RequestTimeout
I actually did do a more ehanced test where I separated data plane operations, e.g PUT
requests to the cosmosDB datbase from the discoverty GET
requests that the nugget does internally and the timeout still wasn't respected.
That's a fair question, thanks for clarifying.
There are two aspects to be considered:
There are two groups of operations on the SDK, metadata and data plane. Metadata operations are not related to user data, but information that is required by the SDK to route or execute data plane operations (obtaining the partition lists, discovering the account details, etc). These are not governed by the RequestTimeout you set, They can have higher latencies and a user configuration should not make them fail.
The request timeout is not respected for BOTH metadata and data plane operations. Both operations should fail if the request timeout is set to 1millisecond because we cannot do an HTTP request in 1 millisecond but when I run the code, it actually carries out the database operation successfully. I had to implement an HTTP server which waits for ever and then the cosmosDB SDK will itself timeout after 65 seconds.
RequestTimeout is not End ot End Timeout. You can have RequestTimeout 10 seconds and the operation can take 2 minutes. The reason: Retries.
We're aware of using cancellation tokens. The issue here is about request timeout and for even 1 http operation. the request timeout is not respected. This doesn't include retires I mean for 1 single http request. the request timeout is not respected both for metadata and data plane operations.
Investigating further, turns out that for SDKs on Gateway mode, the RequestTimeout is only applied if the value is > 60 seconds:
Looking back at the history, this behavior is coming from the initial code of V3 which is also coming from V2 SDK, so the reasoning might be to maintain consistency with the previous SDK.
The SDK however has a set of HttpRetryPolicies: https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/docs/SdkDesign.md#http-retry-layer
They work over this RequestTimeout enforcing other latency guarantees. But the last retry is always 60 seconds to accommodate the behavior of Gateway, which might materialize failover failures with higher latency. If we allowed the user RequestTimeout to exit earlier, the client would never receive these signals that mark regions unavailable.
Describe the bug
The
CosmosClientOptions.RequestTimeout
property of the cosmosDB SDK is not respected. Even when I set the timeout to something really small like 1 millisecond, my cosmosDB operations are still successfully but I expect them to fail.To Reproduce
I created code here which you can reproduce the issue. I expected this code to throw an exception since the timeout is set to
1 millisecond
but it doesn't throw an exception when I do operations likecosmosClient.GetDatabase("arnold-db");
it also doesn't throw an exception when I do an insert querywait container.CreateItemAsync(item);
Expected behavior A clear and concise description of what you expected to happen.
I expected the cosmosDB SDK to throw an exception after the timeout which is set to 1 millisecond both on the
Actual behavior
The cosmosDB SDK performs all my queries without throwing any exception even though the queries go above the 1 millisecond timeout which I set.
Environment summary SDK Version: 3.40.0 (latest) OS Version (e.g. Windows, Linux, MacOSX): Windows, Linux
Additional context Add any other context about the problem here (for example, complete stack traces or logs).
For debugging the issue, I changed the comosDB SDK to point to a server with an infinite wait period I could see that the SDK is trying to use a
65000
millisecond timeout in the http client.|