Azure / azure-cosmos-dotnet-v2

Contains samples and utilities relating to the Azure Cosmos DB .NET SDK
MIT License
577 stars 837 forks source link

Multiple GoneException with "Service is currently unavailable", ForbiddenException with "Authorization is not valid" #623

Open zuckerthoben opened 6 years ago

zuckerthoben commented 6 years ago

Describe the bug In our production environment in Azure we face multiple occurences of "GoneException: Service is currently unavailable". Sometimes another exception is occuring (ForbiddenException: Authorization token is not valid at the current time) at roughly the same time. See stack traces below for some occurences.

I have read all other issues about the GoneException on GitHub and the other issues are quite old and the comments inside are pointing towards a "should be fixed by now" (#194).

When this exception occurs, our customers are not able to use our product, so this has a high importance to us.

To Reproduce There are no known steps to consistently reproduce this issue. It just occurs unregularly inside our Azure production environment for short times.

Expected behavior This behavior could have a lot of causes. It could be the Cosmos DB infrastructure, this SDK, Azure Infrastructure, .NET Core or something else I don't think of. Therefore its hard to tell what shoould have happened. The first expected behavior would of course be that this exception never occurs. The second one would be that if the exception occurs, it has enough information in it for a user to identify the root of the problem.

Actual behavior The mentioned exception occurs and the message together with the information available in the azure portal is not sufficient to track down the cause (at least as a consumer).

Environment summary SDK Version: 2.1.0 (.NET Core 2.1.3), TCP & Direct Mode OS Version (e.g. Windows, Linux, MacOSX): Azure App Service

Additional context In this week we see 21 GoneExceptions and 18 ForbiddenExceptions in Application Insights. We haven't changed implementation before. We also haven't seen the ForbiddenExceptions before and they also always occur in the near of GoneException (but not always the other way round).

Here are the stacktraces: https://gist.github.com/zuckerthoben/95c508db09e9b2ff5fad88afff3849b8

kirillg commented 6 years ago

@zuckerthoben, for server-side issues like this affecting your service, please file an Azure support case. To investigate this, we will need your account, database, collection, exact timeline when this happened, etc. Support is a better channel for this than github.

zuckerthoben commented 6 years ago

@kirillg Yes, we will do that as well. But the thing is that the server side does not show us any errors. All performance and stability indicators for the cosmos db are at their best values. My expectation would be that if the service throws these errors, the indicators would reflect this. Because they are not reflecting our outages at all, we are taking into consideration that its not the server side but maybe another source that creates this problem.

jkonecki commented 6 years ago

I'm an experiencing similar issue with local emulator. My latest opinion is that my issue is related to Direct / TCP connection policy. I've noticed that you're using the same policy. Please consider changing it to default Gateway / HTTP. Hope it helps you in any way.

IGx89 commented 5 years ago

Was the root cause ever figured out here? We started seeing intermittent errors ourselves in production in early November, and consistent errors in QA now (completely separate environment from prod). We've reproduced using the 1.22 client and the 2.2.1 client. We're also using Direct / TCP, plus a partitioned collection.

Example activity IDs:

Current plan assuming no further information is switching to Gateway / HTTP.

jkonecki commented 5 years ago

@IGx89 The problem with local emulator was related to the fact that CosmosDB emulator generates new SSL cert on start. The old cert was accidently added to machine store and picked up. This unfortunately won't help you with Production.

Switching to Gateway / HTTPS should help without adding too much latency.

IGx89 commented 5 years ago

FYI, we tracked down the root cause of our issue: our VNet was configured to use DNS servers that weren't accessible, and while Azure auto-fallbacks to Azure DNS servers in that situation the fallback takes 7 seconds to happen and thus trips up certain time-sensitive networking scenarios (like Direct/TCP DocumentDB connections)