Azure / data-api-builder

Data API builder provides modern REST and GraphQL endpoints to your Azure Databases and on-prem stores.
https://aka.ms/dab/docs
MIT License
874 stars 177 forks source link

⭐ [Enhancement]: Improve Health Endpoint #2366

Open JerryNixon opened 3 weeks ago

JerryNixon commented 3 weeks ago

What is it?

Health as a standard

There is no official industry standard for the health endpoint. /health or variations like /_health are common by convention. ASP.NET Core uses Microsoft.Extensions.Diagnostics.HealthChecks.

Useful for automation

For example, Azure App Service & Azure Kubernetes Service (AKS) support health probes to monitor the health of your application. If a service fails health checks, Azure can automatically restart it or redirect traffic to healthy instances.

Similarly, if Data API builder fails health checks in a way a customer deems past a threshold, they have the option to recycle the container or send an alert to direct engineers.

Term Description
Health Endpoint The URL (e.g., /health) exposed as JSON.
Check A specific diagnostic test (e.g., database, API).
Status The result of a check.
Status.Healthy The system is functioning correctly.
Status.Unhealthy The system has a critical failure or issue.
Status.Degraded The system is functioning, but with issues.

More on degraded

We might opt not to have degraded. But "Degraded" means the system is operational but not performing optimally. For example, for a database, the query duration might exceed a defined threshold.

if (QueryDuration > DurationThreshold) {
    Check.Status = "Degraded"; // Query taking too long, degrading performance
}

Overall health calculation

Healthy Unhealthy Degraded Global Status
- 0 0 Healthy
- ≥ 1 - Unhealthy
- 0 ≥ 1 Degraded

This logic shows how the global health status is determined:

Output standard schema

Health check responses follow a common convention rather than a strict standard. The typical pattern involves a "checks" property for individual components' statuses (e.g., database, memory), with each status rolling up to an overall "status" at the top level.

Basic format

{
  "status": "Healthy",
  "checks": {
    "check-name": { "status": "Healthy" },
    "check-name": { "status": "Healthy" }
  }
}

Example

{
  "status": "Healthy",
  "checks": {
    "database": { "status": "Healthy" },
    "memory": { "status": "Healthy" }
  }
}

Other common fields

Fields like description, tags, data, and exception provide additional metadata.

1. Description:

A textual explanation of what the health check is doing or testing.

{
 "status": "Healthy",
 "description": "Checks database connection and query speed."
}

2. Tags:

Labels or categories that group or identify related health checks.

{
 "status": "Healthy",
 "tags": ["database", "critical"]
}

3. Data:

Any additional information collected during the health check, often technical metrics or diagnostics.

{
 "status": "Degraded",
 "data": {
   "responseTime": "250ms",
   "maxAllowedResponseTime": "100ms"
 }
}

4. Exception:

Information about any error or failure encountered during the health check.

{
 "status": "Unhealthy",
 "exception": "TimeoutException: Database query timed out."
}

Overall example

{
  "status": "Unhealthy",
  "created": "12/12/2000 12:00:00 UTC",
  "cache-ttl": 5,
  "checks": {
    "database": {
      "status": "Unhealthy",
      "description": "Checks if the database is responding within an acceptable timeframe.",
      "tags": ["database", "critical"],
      "data": {
        "responseTime": "500ms",
        "maxAllowedResponseTime": "100ms"
      },
      "exception": "TimeoutException: Database query timed out."
    }
  }
}

These fields help provide a more granular view of the health status, making it easier to understand why a particular check is failing or succeeding.

(Additive) Data API builder config

The standard allows for additive data, like DAB config data we could add.

{
    "status": "Healthy",
    "version": "1.2.10",
    "dab-configuration": {
      "http": true,
      "https": true,
      "rest": true,
      "graphql": true,
      "telemetry": true,
      "caching": true,
      "dab-configs": [
        "/App/dab-config.json (mssql)"
      ],
      "dab-schemas": [
        "/App/schema.json (mssql)"
      ]
    },
    ...
}

Simple implementation

There is no formal guidance on check complexity; however, checks should not make the health endpoint unusable, and checks should implement a cancellation token to support timeouts.

using Microsoft.Extensions.Diagnostics.HealthChecks;

var builder = WebApplication.CreateBuilder(args);

var healthChecks = builder.Services.AddHealthChecks();
healthChecks.AddCheck<CustomHealthCheck>("CustomCheck");

var app = builder.Build();
app.UseHttpsRedirection();
app.MapGet("/date", () => DateTime.Now.ToString());

app.UseHealthChecks("/health");

app.Run();

public class CustomHealthCheck() : IHealthCheck
{
    public async Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context,
        CancellationToken cancellationToken = default)
        => HealthCheckResult.Healthy();
}

In addition to one class for each check, we can reuse checks by leveraging the factory syntax:

string[] endpoints = ["/api1", "/api2"];
foreach (var endpoint in endpoints)
{
    healthChecks.Add(new HealthCheckRegistration(
        name: endpoint,
        factory: _ => new EndpointHealthCheck(endpoint),
        failureStatus: HealthStatus.Unhealthy,
        tags: ["endpoint"],
        timeout: TimeSpan.FromSeconds(10)));
}

Example API check

public async Task<HealthCheckResult> CheckHealthAsync(
    HealthCheckContext context, 
    CancellationToken cancellationToken = default)
{
    var url = $"https://localhost:7128/{endpoint.Trim('/')}?$top=1";
    using var httpClient = new HttpClient();
    var response = await httpClient.GetAsync(url, cancellationToken);

    if (response.IsSuccessStatusCode)
    {
        return HealthCheckResult.Healthy();
    }

    return HealthCheckResult.Unhealthy($"Invalid HTTP response.");
}

Configuration changes

Because we have the configuration, we know if this is a stored procedure or table/view endpoint. We might want to allow the developer to influence how the checks work against the endpoint/entity.

{
  "runtime" : {
    "health" : {
      "enabled": true, (default: true)
      "cache-ttl": 5, (optional default: 5)
      "roles": ["sean", "jerry", "*"] (optional default: *)
    }
  }
}
{
  "data-source" : {
    "health" : {
      "moniker": "sqlserver", (optional default: GUID)
      "enabled": true, (default: true)
      "query": "SELECT TOP 1 1", (option)
      "threshold-ms": 100 (optional default: 10000)
    }
  }
}
{
  "<entity-name>": {
      "health": {
        "enabled": true, (default: true)
        "filter": "Id eq 1" (optional default: null),
        "first": 1 (optional default: 1),
        "threshold-ms": 100 (optional default: 10000)
      },
      ...
    },
  }
}

Output sample

{
  "status": "Healthy",
  "version": "1.2.3.4",
  "created": "12/12/2000 12:00:00 UTC",
  "dab-configuration": {
    "http": true,
    "https": true,
    "rest": true,
    "graphql": true,
    "telemetry": true,
    "caching": true,
    "health-cache-ttl": 5,
    "dab-configs": [
      "/App/dab-config.json (mssql)"
    ],
    "dab-schemas": [
      "/App/schema.json"
    ]
  },
  "checks": {
    "database": {
      "status": "Healthy",
    },
    "<entity-name>": {
      "status": "Healthy",
    },
    "<entity-name>": {
      "status": "Healthy",
    },
  }
}

Questions

  1. What to show in development versus production?
    • Not an issue, use Enabled globally.
  2. Should we introduce a formatted version? Like https://localhost/health/ui.
    • Not in the first effort.
  3. We should create a DAB health JSON schema! Yes!

Related issues to close

seantleonard commented 9 hours ago

Another healthcheck example: DabHealthCheck.cs https://github.com/Azure/data-api-builder/blob/bbe1851df86065245d2bdd342d8b75a9304f2e00/src/Service/HealthCheck/DabHealthCheck.cs#L17