Xabaril / AspNetCore.Diagnostics.HealthChecks

Enterprise HealthChecks for ASP.NET Core Diagnostics Package
Apache License 2.0
4.04k stars 791 forks source link

Too frequent requests to healthy checks endpoints "/health/live" and "/health/ready" can cause unhealthy error logs irregularly #829

Closed terryhuangbo closed 1 year ago

terryhuangbo commented 3 years ago

Please, fill the following sections to help us fix the issue

What happened: When I use AspNetCore.HealthChecks.AzureServiceBus-5.1.1 and AspNetCore.HealthChecks.AzureStorage-5.0.1 AspNetCore.HealthChecks.AzureKeyVault-5.0.1 for health checking, requests to healthy check endpoints "/health/live" and "/health/ready" return healthy results, but too frequent requests can cause unhealthy error logs irregularly. Like bellowing: Health check AzureServiceBus-High completed after 5001.3016ms with status Unhealthy and description '(null)' System.Threading.Tasks.TaskCanceledException: The operation was canceled. at Azure.Core.CancellationHelper.ThrowOperationCanceledException(Exception innerException, CancellationToken cancellationToken) at Azure.Core.CancellationHelper.ThrowIfCancellationRequested(CancellationToken cancellationToken) at Azure.Core.Pipeline.ResponseBodyPolicy.ThrowIfCancellationRequestedOrTimeout(CancellationToken originalToken, CancellationToken timeoutToken, Exception inner, TimeSpan timeout) at Azure.Core.Pipeline.ResponseBodyPolicy.ProcessAsync(HttpMessage message, ReadOnlyMemory1 pipeline, Boolean async) at Azure.Core.Pipeline.RedirectPolicy.ProcessAsync(HttpMessage message, ReadOnlyMemory1 pipeline, Boolean async) at Azure.Core.Pipeline.RetryPolicy.ProcessAsync(HttpMessage message, ReadOnlyMemory1 pipeline, Boolean async) at Azure.Core.Pipeline.RetryPolicy.ProcessAsync(HttpMessage message, ReadOnlyMemory1 pipeline, Boolean async) at Azure.Core.Pipeline.HttpPipeline.SendRequestAsync(Request request, CancellationToken cancellationToken) at Azure.Messaging.ServiceBus.Administration.HttpRequestAndResponse.SendHttpRequestAsync(Request request, CancellationToken cancellationToken) at Azure.Messaging.ServiceBus.Administration.HttpRequestAndResponse.GetEntityAsync(String entityPath, String query, Boolean enrich, CancellationToken cancellationToken) at Azure.Messaging.ServiceBus.Administration.ServiceBusAdministrationClient.GetQueueRuntimePropertiesAsync(String name, CancellationToken cancellationToken) at HealthChecks.AzureServiceBus.AzureServiceBusQueueHealthCheck.CheckHealthAsync(HealthCheckContext context, CancellationToken cancellationToken)

What you expected to happen: No error logs happen.

How to reproduce it (as minimally and precisely as possible): As described ahead. Source code sample: services.AddHealthChecks() .AddAzureBlobStorage(configuration["EmailTemplateBlobConnectionStr"], name: "AzureStorage", failureStatus: HealthStatus.Unhealthy, tags: new[] { HealthCheckConstants.ReadinessTag }) .AddAzureServiceBusQueue( name: "AzureServiceBus-High", connectionString: configuration["EmailDeliveryServiceBusConnection"], queueName: "emaildelivery-high", failureStatus: HealthStatus.Unhealthy, tags: new[] { HealthCheckConstants.ReadinessTag }) .AddAzureServiceBusQueue( name: "AzureServiceBus-Normal", connectionString: configuration["EmailDeliveryServiceBusConnection"], queueName: "emaildelivery-normal", failureStatus: HealthStatus.Unhealthy, tags: new[] { HealthCheckConstants.ReadinessTag }) .AddAzureServiceBusQueue( name: "AzureServiceBus-Low", connectionString: configuration["EmailDeliveryServiceBusConnection"], queueName: "emaildelivery-low", failureStatus: HealthStatus.Unhealthy, tags: new[] { HealthCheckConstants.ReadinessTag }) .AddAzureKeyVault( name: "AzureKeyVault", keyVaultServiceUri: new Uri($"https://{configuration["KeyVaultName"]}.vault.azure.net/"), credential: new DefaultAzureCredential(), setup: options => { options.AddSecret(configuration["KeyVaultSecrectName"]); }, failureStatus: HealthStatus.Unhealthy, tags: new[] { HealthCheckConstants.ReadinessTag }); Anything else we need to know?: Nothing Environment:

WeChat Screenshot_20210903155428

thisispaulsmith commented 2 years ago

I'm also seeing this. Any solution?

unaizorrilla commented 2 years ago

Hi

I can’t reproduce the issue! After seing the callstack this is a error on the retry in the servicebus connection.. ie, the check is working and notify a issue after 5 seconds pf retries!

thisispaulsmith commented 2 years ago

We don't see any of these errors under normal use. Only during healthcheck endpoints. We're at the point we're it's happening so often we're going to disable the specific health check.

Meandron commented 2 years ago

We're seeing this issue as well since we run regular live and health checks on the ServiceBus using AddAzureServiceBusQueue() to register the actual check. If you need more information please let me know.

Meandron commented 2 years ago

When I run the following code from a local console app, no exception is raised. The code more or less corresponds to the one in the AzureServiceBusQueueHealthCheck:

static async System.Threading.Tasks.Task Main(string[] args) { try { while (true) { ServiceBusAdministrationClient managementClient = new ServiceBusAdministrationClient(MYENDPOINT); var x = await managementClient.GetQueueRuntimePropertiesAsync(MYQUEUE, CancellationToken.None); Thread.Sleep(5000); } } catch (Exception ex) { throw ex; } }

I took this code as a custom implementation, added it to my .net core application and deployed it to k8s where there is a liveness probe with a period of 30s set. It raised the exceptions like @terryhuangbo described.

The infrastructure being checked was the same, i.e. the ConnectionString pointed to the equal service bus queue located on the cloud.

@unaizorrilla: Could it be there is a CancellationToken passed / present on the live system / .net core when performing the check that may cause the issue somehow? We're stuck... There is no other way for us to disable the liveness probe as the pods get into a recreation-loop with the probe keeps failing.

Meandron commented 2 years ago

When I run the following code from a local console app, no exception is raised. The code more or less corresponds to the one in the AzureServiceBusQueueHealthCheck:

static async System.Threading.Tasks.Task Main(string[] args) { try { while (true) { ServiceBusAdministrationClient managementClient = new ServiceBusAdministrationClient(MYENDPOINT); var x = await managementClient.GetQueueRuntimePropertiesAsync(MYQUEUE, CancellationToken.None); Thread.Sleep(5000); } } catch (Exception ex) { throw ex; } }

I took this code as a custom implementation, added it to my .net core application and deployed it to k8s where there is a liveness probe with a period of 30s set. It raised the exceptions like @terryhuangbo described.

The infrastructure being checked was the same, i.e. the ConnectionString pointed to the equal service bus queue located on the cloud.

@unaizorrilla: Could it be there is a CancellationToken passed / present on the live system / .net core when performing the check that may cause the issue somehow? We're stuck... There is no other way for us to disable the liveness probe as the pods get into a recreation-loop with the probe keeps failing.

It turned out increasing the liveness check's timeout (1sec -> 20sec) makes it better, but the error still shows up from time to time...

Meandron commented 2 years ago

while (true) { ServiceBusAdministrationClient managementClient = new ServiceBusAdministrationClient(MYCON); Stopwatch stopwatch = new Stopwatch(); stopwatch.Start(); var x = await managementClient.GetQueueRuntimePropertiesAsync(MYQUEUE, CancellationToken.None); stopwatch.Stop(); Console.WriteLine("Elapsed Time is {0} ms", stopwatch.ElapsedMilliseconds); Thread.Sleep(1000); }

Executed locally with using infra from cloud. Results are quite amazing...:

Elapsed Time is 90 ms Elapsed Time is 54 ms Elapsed Time is 39 ms Elapsed Time is 147 ms Elapsed Time is 289 ms Elapsed Time is 46 ms Elapsed Time is 541 ms Elapsed Time is 42 ms Elapsed Time is 234 ms Elapsed Time is 1059 ms Elapsed Time is 47 ms Elapsed Time is 215 ms Elapsed Time is 1514 ms Elapsed Time is 5314 ms Elapsed Time is 1442 ms

thisispaulsmith commented 2 years ago

We've had to remove this health check. Will watch this thread for updates. Even increasing the timeout didn't help.

molinch commented 2 years ago

Is that issue still a thing @thisispaulsmith @Meandron ?

Meandron commented 2 years ago

Yes, it is still there. We worked around it by creating our own HealthCheck class which does pretty much the same but only logs warnings as otherwise our logs get spammed with "false" error logs.

We receive those kind of error log also for other health check types, like BlogStorage: BlobStorage

(Notice the completion time)

ferryferry commented 2 years ago

I'm experiencing the same thing... It almost seems like we are getting throttled form the management client to hook into Azure with multiple replica's at the same servicebus.

By default you have 1000 credits, every management call (what this healthcheck is doing) costs 10 credits. If you have a bunch of replicas, it can count up to throtteling. See: https://docs.microsoft.com/en-us/azure/service-bus-messaging/service-bus-throttling#what-are-the-credit-limits

When using the premium service bus, no issues...

ddegasperi commented 1 year ago

I've the same issue with the NpgSqlHealthCheck:

Health check Database with status Unhealthy completed after 18.4543ms with message '(null)'
System.Threading.Tasks.TaskCanceledException:
   at Npgsql.TaskExtensions+<WithCancellation>d__1`1.MoveNext (Npgsql, Version=6.0.6.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Npgsql.TaskExtensions+<WithTimeout>d__0`1.MoveNext (Npgsql, Version=6.0.6.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Npgsql.Internal.NpgsqlConnector+<ConnectAsync>d__198.MoveNext (Npgsql, Version=6.0.6.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Npgsql.Internal.NpgsqlConnector+<RawOpen>d__196.MoveNext (Npgsql, Version=6.0.6.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Npgsql.Internal.NpgsqlConnector+<<Open>g__OpenCore|191_1>d.MoveNext (Npgsql, Version=6.0.6.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Npgsql.Internal.NpgsqlConnector+<Open>d__191.MoveNext (Npgsql, Version=6.0.6.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Npgsql.UnpooledConnectorSource+<Get>d__6.MoveNext (Npgsql, Version=6.0.6.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Npgsql.NpgsqlConnection+<<Open>g__OpenAsync|45_0>d.MoveNext (Npgsql, Version=6.0.6.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at HealthChecks.NpgSql.NpgSqlHealthCheck+<CheckHealthAsync>d__4.MoveNext (HealthChecks.NpgSql, Version=6.0.2.0, Culture=neutral, PublicKeyToken=null: /_/src/HealthChecks.NpgSql/NpgSqlHealthCheck.cs:42)
sungam3r commented 1 year ago

As @ferryferry said the core issue is (likely) rate limiting on server side so nothing to do on the caller side - you just can choose appropriate intervals for checks or use premium service bus.

@ddegasperi Let's track npgsql in another issue.

ddegasperi commented 1 year ago

@ddegasperi Let's track npgsql in another issue.

@sungam3r ok, I've created a new issue #1611

isaevdan commented 1 year ago

@sungam3r I have the same issue with the healthcheck and checked logs for throttled requests and there are non of them. Also the limit is 1000 requests per second (where management requests count as 10 and regular reads and writes as 1), so I highly doubt that issue is related to throttling

robinmanuelthiel commented 8 months ago

Anyone was ever able to find a solution for this?

rfink commented 6 months ago

I'm having this issue with Npgsql as well. .NET 8.0.2.

TAC911 commented 2 months ago

Same issue with version 8.0.1 for Azure Service Bus, Npgsql as well as Redis (timeout)