Feature request for enabling read requests cross region retry and at the same time makes it go to the write region (dynamic, not a fix value) first

Azure / azure-cosmos-dotnet-v3

.NET SDK for Azure Cosmos DB for the core SQL API

MIT License

741 stars 494 forks source link

Feature request for enabling read requests cross region retry and at the same time makes it go to the write region (dynamic, not a fix value) first #4367

Open blankor1 opened 7 months ago

blankor1 commented 7 months ago

Is your feature request related to a problem? Please describe. For now, to enable the cross region retry during an outage, we need to set the ApplicationRegion(ApplicationPerferredRegion) in the CosmosClientOptions. But if we do this, and a failover happens, we have to update this property and restart the service to make the read request by default going to the same region as write request (For consistency reason). If it's service-managed failover, the read request will always go to the read region first since we may not even aware of this. And this will increase the chance that the write haven't been synced to the read region and we get 404 for read. There is no way to change it excepting for restarting the service.

Describe the solution you'd like

The SDK to, regardless of being a setting or not (ApplicationRegion or ApplicationPreferredRegion), use other available regions if possible even if the preferred regions/ApplicationRegion are not set, for cases when it leverages other available regions for availability attempts. [ealsur - The scope is regarding 503/timeout cross-region retry, which today depends on Preferred Regions being set]

By default, even without the ApplicationRegion or ApplicationPreferredRegion being configured, the read request should first be executed in the write region(dynamic, not a fix value). And if an outage happens, the read request should be able to retry on other available read regions instead of keeping routing to the write region.

Describe alternatives you've considered Give user a configuration to let the read request always go to write region(dynamic, not a fix value) first even the write region changes(due to failover or other reason), and at the same time have read requests cross region retry on 503(or other possible conditions) if an outage happens.

Additional context

ealsur commented 7 months ago

If an outage happens, the read request should be able to by default retry on the read region instead of keeping routing to the write region.

The SDK does this on outages. But "outage" is a generic word that can describe many scenarios that materialize in different ways.

Could you please share what is the definition in terms of errors/exceptions you are seeing that you are describing as outage?

blankor1 commented 7 months ago

@ealsur The SDK only does this if the ApplicationRegion or ApplicationPerferredRegion is configured. But if this property is configured, then, if a manual failover occurs, the read requests will be executed in the read region not write region. And in this case, it amplifies the possibility of can't read your write.

Previously we didn't configure this value so in the outage our read availablity is affected in the write region outage. So we add this configure, but found out that if we do the manual failover, it increases many cases that we can't read the write. We would really like to have an option to "first-read always be performed from the write region only, and at the same time the cross region read request retry is enabled".

ealsur commented 7 months ago

I'm still confused. Yes, there are some scenarios where the PreferredRegions are required and that is publicly documented: https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-sdk-availability#transient-connectivity-issues-on-tcp-protocol, but not all of them.

The SDK only does this if the ApplicationRegion or ApplicationPerferredRegion is configured.

That is not entirely correct. There are scenarios where the conditions will make the SDK mark regions on the local map as unavailable and skip them, like HttpRequestException handling, or a 403/1008 or 403/3 from the service and refresh the account routing map. If the region is indeed marked as unavailable during the outage, then it might not be returned on the account information and then it's not visible to the client.

if a manual failover occurs, the read requests will be executed in the read region not write region. And in this case, it amplifies the possibility of can't read your write.

In that case, assuming Session consistency, the service will return 404/1002 if the read region is not up to speed with the session token, and would make the SDK retry the read on the write region.

Previously we didn't configure this value so in the outage our read availablity is affected in the write region outage

Can you please share some error logs/exceptions so we can understand what happened on the outage? Like I previously said, in a distributed system, an outage can be materialized in different ways and different error conditions. Understand which is the one that affected you would help guide this feature request.

blankor1 commented 7 months ago

In previous Australia east zone outage(Aug 30, 2023), before the Service-managed failover take effects, we received many 408 and 503 statuscode for a long time. And there are also many HttpRequestException(SocketException(10060)) that are keep retrying in the same region (the outage write region) Please see this link for the metrics we hit very long respone time for each single request: https://github.com/Azure/azure-cosmos-dotnet-v3/issues/4006#issuecomment-1718996070. I tried to searched but the detailed log can't be found now since it's too long from that time.

At that time, we didn't configure our applications with ApplicationRegion or ApplicationPerferredRegion. So seeing from our applications' metics, our read API availability is also being affected.

That is not entirely correct. There are scenarios where the conditions will make the SDK mark regions on the local map as unavailable and skip them, like HttpRequestException handling, or a 403/1008 or 403/3 from the service and refresh the account routing map. If the region is indeed marked as unavailable during the outage, then it might not be returned on the account information and then it's not visible to the client.

Not sure if I get it. Do you mean that even we don't configure ApplicationRegion or ApplicationPreferrdRegion, we will still have cross-region retry for read request in the outage? I know for 403.3 it will retry on other available write region since write request can only be executed in the write region, but for read requests it's not what we observed in the outage. They seem sticked in the write region without cross region retry.

What we basically want is a configurable option that "first-read always be performed from the write region only, and at the same time the cross region read request retry is enabled" which is similar to a mix of this two: https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/troubleshoot-sdk-availability#:~:text=Expand%20table-,Account%20type,Primary%20region,-Note

ealsur commented 7 months ago

Let me try to take some of the things apart:

408s - If you are getting a 408 out of the SDK, it would mean Write receive timeout (that's the case as far as I know) and these are not retried: https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/troubleshoot-sdk-availability#transient-connectivity-issues-on-tcp-protocol
503s - These are the ones where our public documentation calls out that are only retried if Preferred Regions are specified. If the feature request is for these, then we can scope the request to this for analysis.
HttpRequestException - The SDK in this case refreshes the account information. In the case of the outage you mention, the problem was that the account information was still saying the region in outage was a valid region to be used. This is the part where I mention that not all outages are the same. The SDK drives out of the account information, ideally, if the outage is identified, then the service will react accordingly by not including that region. This is being constantly improved by the service team.

Do you mean that even we don't configure ApplicationRegion or ApplicationPreferrdRegion, we will still have cross-region retry for read request in the outage?

As I mentioned, this varies from outage to outage, that is why I wanted to understand a bit better. What I say is that there are scenarios where the SDK does not required Preferred Regions. 403/3 can happen during write region transition and during that time, the service will also return the second region where the write endpoint is transitioning as part of the account information, and the SDK will use that without you having to specify any preference. For 403/1008 is similar.

What we basically want is a configurable option that "first-read always be performed from the write region only, and at the same time the cross region read request retry is enabled" which is similar to a mix of this two

If the ask is a configurable option, wouldn't that option be the Preferred Regions? Or is the ask that without any setting, the retry happens on 503s on the next available region for read operations?

blankor1 commented 7 months ago

ApplicationPreferredRegion can't meet the requirement that "first-read always be performed from the write region only, and at the same time the cross region read request retry is enabled". Take what we hit as an example, we do a manual failover, then previous write region will now become a read region, but read requests will still be executed first in this read region, not current write region.

For 408s and HttpRequestException, can you explain a bit more about it? When ApplicationRegion or ApplicationPerferredRegion is not set:

The read requests 408 will be retried, but will it contains cross region retry if the retry on the primary region keeps failing?
HttpRequestException: if the SDK identified the region has outage, the service react by not including that region. Will service then retry it on other region than the primary region? But this seems conflict with what you previously mentioned: https://github.com/Azure/azure-cosmos-dotnet-v3/issues/4006#:~:text=For%20initialization%2C%20there,CosmosClientOptions.ApplicationRegion.

ealsur commented 7 months ago

first-read always be performed from the write region only, and at the same time the cross region read request retry is enabled

Basically what you are asking is what I mentioned that without any setting, the retry happens on 503s on the next available region for read operations. Adding a setting for this behavior sounds confusing for users? But if that is the ask, the team can certainly review it (I'm mainly trying to get clarity from this thread).

The read requests 408 will be retried, but will it contains cross region retry if the retry on the primary region keeps failing?

408s are not retried. They refer to Write Receive timeouts. I mentioned it on my previous comment.

HttpRequestException: if the SDK identified the region has outage, the service react by not including that region. Will service then retry it on other region than the primary region? But this seems conflict with what you previously mentioned

The SDK does not identify outages. It can identify conditions that might mean that either there is an outage or the client machine is having connectivity issues. There is no way for the SDK to know about outages. There are conditions (server responses like 403/3 or 403/1008) and network conditions (timeouts, DNS resolution failures). HttpRequestException means the client cannot resolve or cannot connect to the Cosmos DB Gateway endpoint. In some scenarios, when there really is an outage and is identify by the service, the Gateway will remove the outage region from the account information. When the SDK refreshes the account information, it will receive that the region is no longer part of the account and not use it, the SDK can only route requests part of the account information.

I want to reiterate here the point, the SDK does not know about outages. In fact, what you described could also mean that connectivity on a machine is fully cut off on the HTTP and TCP connectivity and that would not be an outage. It drives routing decisions based on what the service tells it (account information) and the user preference.

What I want to try and take out of this thread and all the discussion is that your ask is a way for the SDK to, regardless of being a setting or not, use other regions if possible even if the preferred regions are not set, for cases when it leverages the preferred regions for availability attempts. Would that be the case?

ealsur commented 7 months ago

This Issue might also be related https://github.com/Azure/azure-cosmos-dotnet-v3/issues/3906

blankor1 commented 7 months ago

Really thank you a lot for your very detailed explanation!

What I want to try and take out of this thread and all the discussion is that your ask is a way for the SDK to, regardless of being a setting or not, use other regions if possible even if the preferred regions are not set, for cases when it leverages the preferred regions for availability attempts. Would that be the case?

Yes. This would be great! For us, what we want is a way to achieve "read always go to write region first, even if a manual failover happens, the read request still go to the new write region first, and at the same time it can leverages preferred regions (or any available regions for availability retry attempts)" As the client already know every available regions, it would be really great (and maybe more intuitive?) that the SDK can automatically do some availability retry attempts on those region for higher availability.

408s are not retried. They refer to Write Receive timeouts. I mentioned it on my previous comment.

408s are not retried even for read requests? But I see this doc, the SDK will retry read 408 and 503 as they are idempotent?https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/conceptual-resilient-sdk-applications#:~:text=For%20read%20operations%2C%20the%20SDKs%20will%20retry%20any%20timeout%20or%20connectivity%20related%20error.

ealsur commented 7 months ago

But I see this doc

The doc there is generic and not specific to any language. 408s are related to timeouts, the .NET SDK retries timeouts, in .NET they materialize as TaskCanceledException for HTTP or TCP timeouts (IO exceptions) on Direct mode requests. These are retried for read request.

The difference in .NET is that the 408 is thrown on a receive timeout for a Write operation to the application. We cannot throw the IO exception, so it gets converted to a 408.

I don't want to edit your comment at the top, but could you add the summarization there of the discussion?

blankor1 commented 7 months ago

Sry for the late response. I have updated the first comment. Please help review to see if more context is needed.

blankor1 commented 5 months ago

Gentle ping. Do we have any update about this feature request?

ealsur commented 5 months ago

@blankor1 team planning takes into account feature requests, at the moment, the team is already engaged in deliverables and we can account for new feature requests in the next semester. This issue was created in April, the team should take it in to consideration for the next planning cycle.

blankor1 commented 2 months ago

Hi, may I ask the progress of this feature request? Will it be planned and implemented in this semester?

For now, there is no way for us to dynamically detect the region failover during the runtime and change the ApplicationRegion without downtime which makes "Change write region" a risky operation for us.

blankor1 commented 2 months ago

@ealsur Any update for this? Thanks!