Xabaril / AspNetCore.Diagnostics.HealthChecks

Enterprise HealthChecks for ASP.NET Core Diagnostics Package
Apache License 2.0
4.07k stars 794 forks source link

CosmosDB health check is healthy whilst the write calls timeout #333

Closed DevZenFlow closed 1 year ago

DevZenFlow commented 4 years ago

Hi,

We've been using https://github.com/Xabaril/AspNetCore.Diagnostics.HealthChecks/blob/master/src/HealthChecks.CosmosDb/CosmosDbHealthCheck.cs for a while and learnt yesterday that the health check remains green/healthy when write calls to the collection timeout.

I don't know how Cosmos works behind the scenes, but cosmosDbClient.ReadAccountAsync() doesn't imply that we can successfully write into the collection. The issue was in our pod not being able to write from a particular cluster, as opposed to issue being in the cosmosdb itself. Regardless, the health check should pick this up.

I wonder whether it'll be safer to create a health check collection within cosmos and then periodically write/read/delete to/from this collection?

unaizorrilla commented 4 years ago

Hi @vikpck , thanks for fill this issue!

I try to review this issue asap!

DevZenFlow commented 4 years ago

Spent a few more days looking into this. This is an intermittent connectivity issue whereby client issues TCP Retransmit when communicating with Cosmos DB. These retransmits are issued by Linux kernel, as opposed to Client SDK and they really are intermittent. I could not replicate them with Cosmos health check, but can replicate them when writing data into collections, but only in a specific Azure network setup (North Europe) - application hosted in AKS and is on the same backbone as Cosmos DB.

Maybe it’s worth leaving health check as it is, but develop an additional one that would write/read/delete from custom collection. This way the consumer can select which health check they wish to use. Of course, the underlying assumption is that if I can write into one collection, then I can write into other collections in the same DB instance and therefore the probe is healthy. Can try doing a pull request if you want?

unaizorrilla commented 4 years ago

Hi @vikpck We accept a PR to improve CosmosDb healtcheck for sure, reading docs on this Microsoft is suggesting

To use the .NET SDK, use the DocumentClient.ReadDocumentCollectionAsync method, which returns a ResourceResponse that contains a number of usage properties such as CollectionSizeUsage, DatabaseUsage, DocumentUsage, and more.

https://docs.microsoft.com/en-us/azure/cosmos-db/monitor-cosmos-db

If you are able to send a PR on this we'll be so happy!