dotnet / extensions

This repository contains a suite of libraries that provide facilities commonly needed when creating production-ready applications.
MIT License
2.63k stars 752 forks source link

ResourceUtilizationHealthCheck only reports CPU _or_ memory issues, not both #4677

Closed Tratcher closed 1 month ago

Tratcher commented 11 months ago

Description

When checking health based on resource utilization, the conditions are checked in series and only the first failure is reported. https://github.com/dotnet/extensions/blob/68810ca7399be38b8490a114ef52fdae4ce00058/src/Libraries/Microsoft.Extensions.Diagnostics.HealthChecks.ResourceUtilization/ResourceUtilizationHealthCheck.cs#L39-L60

Reproduction Steps

Test on a system with both high CPU and memory.

Expected behavior

Both CPU and memory issues should be reported.

Actual behavior

Only the first issue found is reported.

Regression?

No response

Known Workarounds

No response

Configuration

8.0

Other information

Recommendation: ResourceUtilizationHealthCheck should be split into two IHealthCheck implementations, one for CPU and one for memory. That way both states can be reported independently.

willibrandon commented 3 months ago

I just bumped into this myself and agree with @Tratcher’s recommendation.

Recommendation: ResourceUtilizationHealthCheck should be split into two IHealthCheck implementations, one for CPU and one for memory. That way both states can be reported independently.

Anything I can do to help?

RussKie commented 3 months ago

Thank you for bumping this issue, @willibrandon. To me it totally makes sense.

@geeknoid @rsreepathi @mwierzchowski @mobratil what do you think of this?

RussKie commented 3 months ago

To start the ball rolling, we'll need an API proposal which we'll then discuss and take to the ASP.NET API Review Board for further approval.

RussKie commented 3 months ago

/cc: @evgenyfedorov2

evgenyfedorov2 commented 3 months ago

Hi @willibrandon , thanks for stepping in.

It does not look like we have to make any changes to public API surface, we probably can just create those two proposed IHealthCheck implementations and register both via the existing extension methods.

What do you think?

willibrandon commented 2 months ago

@evgenyfedorov2 - I apologize I am usually more prompt than this.

Yes I agree that we can just create those two proposed IHealthCheck implementations, but I also think that the health of each component should be included in the HealthCheckResult using the optional IReadOnlyDictionary<string, object> data parameter.

willibrandon commented 2 months ago

Or perhaps it would be more appropriate to return a HealthReport?

Represents the result of executing a group of IHealthCheck instances.

RussKie commented 2 months ago

Or perhaps it would be more appropriate to return a HealthReport?

Represents the result of executing a group of IHealthCheck instances.

From my observations, dotnet/aspnetcore has two implementations:

Task<HealthCheckResult> CheckHealthAsync(...)
Task<HealthReport> CheckHealthAsync(...)

The first one is used whenever there is a single component to be checked for health (this is the part of the IHealthCheck contract). The last one is used whenever there are multiple components involved (this is part of HealthCheckService contract).

In our case, we use the first signature, however, we check multiple components (i.e., CPU and memory), and this doesn't feel correct. I believe, we should have defined multiple health checks - one for the CPU utilisation, one for memory consumption, etc. Then those health checks should get registered via the standard mechanisms and aggregated via HealthCheckService implementations.

sebastienros commented 2 months ago

IMO the component should keep doing what it is meant for (resource utilization health checks) and the simplest fix would be to just send a compound message per health level:

Creating two separate healthchecks is a different issue, though could make sense since healthchecks reports is a thing that would solve the aggregation issue (a breaking change if this one is removed instead). Also someone can already track CPU only or memory only with this component (by setting custom limits for each).

RussKie commented 2 months ago

IMO the component should keep doing what it is meant for (resource utilization health checks) and the simplest fix would be to just send a compound message per health level:

  • if CPU and memory are unhealthy send a message like "CPU and Memory usages are above their limits"
  • if only CPU or memory is unhealthy keep the current message (independently of the other being in degraded status)
  • if CPU and memory are degraded send a message like "CPU and Memory usages are close to their limits"
  • if only CPU or memory is degraded keep the current message

Creating two separate healthchecks is a different issue, though could make sense since healthchecks reports is a thing that would solve the aggregation issue (a breaking change if this one is removed instead). Also someone can already track CPU only or memory only with this component (by setting custom limits for each).

@sebastienros, to clarify I understand your suggestion correctly, do you mean the following?

message = "";
if (cpu != healthy) { message += "CPU unhealthy" }
if (memory != healthy) { message += "memory unhealthy" }
return HealthCheckResult.Unhealthy(message);
sebastienros commented 2 months ago

@RussKie something like this, yes.

An alternative could be to have the degraded message also added, but still return the highest of the levels. Example with CPU unhealthy and memory degraded:

CPU usage is above the limit. Memory is close to the limit, and return Unhealthy because one them at least is unhealthy. If both were degraded then Degraded would be returned.

That would also make the code easier. Just keep the current checks, but don't return, just append to the message. And in the end return the highest detected level.

RussKie commented 2 months ago

As a short-term solution/workaround this is probably ok, but to me this still feels as conflating two (or more) checks into one, which feels like conflicting with the purpose of the IHealthCheck contract.

sebastienros commented 2 months ago

@RussKie definitely.

I assume that the goal here was performance. Two healthchecks would mean two calls to _dataTracker.GetUtilization(_options.SamplingWindow); which might be expensive.

willibrandon commented 2 months ago

A compound message in the description works for me and even better if additional key-value pairs describing the health of each component could be included in the data.