Xabaril / AspNetCore.Diagnostics.HealthChecks

Enterprise HealthChecks for ASP.NET Core Diagnostics Package
Apache License 2.0
4.05k stars 794 forks source link

Odd extreme health check timeouts with SQL Server provider #1844

Open nullpainter opened 1 year ago

nullpainter commented 1 year ago

I am experiencing bizarre timeouts with the SQL Server provider, and cannot work out why. For example:

Health check "sqlserver" with status Unhealthy completed after 47.8154ms with message 'null'
Health check "sqlserver" with status Unhealthy completed after 711044.8055ms with message 'null'
Health check "sqlserver" with status Unhealthy completed after 34.8508ms with message 'null'
Health check "sqlserver" with status Unhealthy completed after 659491.1559ms with message 'null'
Health check "sqlserver" with status Unhealthy completed after 44.4924ms with message 'null'
Health check "sqlserver" with status Unhealthy completed after 34.4111ms with message 'null'

I am using AspNetCore.HealthChecks.SqlServer version 6.0.2 and am calling AddSqlServer() with a 10 second timeout.

I have a BackgroundService which is explicitly calling HealthCheckService.CheckHealthAsync(), passing in a cancellation token which is a combination of another timeout of 20s, and the BackgroundService's stoppingToken.

Simplified, the code is:

protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
    using var maxTimeToken = new CancellationTokenSource(TimeSpan.FromSeconds(20));
    using var processingToken = CancellationTokenSource.CreateLinkedTokenSource(stoppingToken, maxTimeToken.Token);

    ...

    var healthReport = await _healthCheckService.CheckHealthAsync(processingToken.Token);

    ...
}    

(the actual code utilises a PeriodicTimer to poll for health repeatedly)

I can see that my cancellation token is being passed down to SqlServerHealthCheck, and I can also see that the 10s timeout is being used in DefaultHealthCheckService when constructing its own cancellation token.

My questions are two-fold:

  1. How can the health check fail after ~30ms unless ASP.NET Core was cancelling stoppingToken? My understanding is that this should only happen if my application was shutting down.
  2. How could the health job time out after almost 12 minutes (one of the log entries above)? Quite apart from the various timeouts in the cancellation tokens in both my code, your code and ASP.NET Core's code, SqlCommand.ExecuteScalarAsync() has a default timeout of 30s. The code which generates this log message looks straightforward.

The more I look into this and the more I debug it, the more confused I'm becoming.

sungam3r commented 1 year ago

Please try https://www.nuget.org/packages/AspNetCore.HealthChecks.SqlServer/7.0.0-rc2.11 . v7 release is coming soon.

william-keller commented 11 months ago

Dear friends @nullpainter and @sungam3r

I am glad to announce that after hours of investigation, i could identify and fix the root cause of this behavior.

In my scenario, it was happening because of a lack of CPU. (My container was running with only 0.5 CPU units of AWS ECS Fargate).

Increasing the CPU solved the problem.

Look: every period of 5 min, when my app perform a healthcheck, the CPU usage spikes to 100%.

image

sungam3r commented 9 months ago

Glad to hear. Closing?

ghost commented 4 weeks ago

Time profile shows that usually after 1 second request ends with

System.Threading.Tasks.TaskCanceledException: A task was canceled. at HealthChecks.SqlServer.SqlServerHealthCheck.CheckHealthAsync(HealthCheckContext context, CancellationToken cancellationToken) in /_/src/HealthChecks.SqlServer/SqlServerHealthCheck.cs:line 28

which provides exactly none clue.

Profile