Azure / data-api-builder

Data API builder provides modern REST and GraphQL endpoints to your Azure Databases and on-prem stores.
https://aka.ms/dab/docs
MIT License
947 stars 195 forks source link

⭐ [Enhancement]: Performance Counters with OpenTelemetry #2397

Open JerryNixon opened 1 month ago

JerryNixon commented 1 month ago

What is it?

Use OpenTelemetry to add tracing events and top-level counters for exporting to monitors and the health endpoint.

Value prop

Besides aligning with industry trends, CorrelationId (TraceId) is crucial for cross-service traceability in OpenTelemetry. It links spans across multiple services, providing a full view of a request’s lifecycle in distributed systems.

Versus logging

This does not mean Serilog is unnecessary. Serilog handles logging, while OpenTelemetry focuses on tracing and metrics. Both are useful and often work together.

OpenTelemetry

OpenTelemetry is the standard for shipping metrics and events in Azure.
Learn more

OpenTelemetry can work alongside ILogger for logging, but metrics and traces are handled separately. OpenTelemetry is designed for distributed tracing and metrics collection, while ILogger focuses on logging.

Concepts

Concept Description
Meter A component that creates and manages metrics (e.g., counters, histograms) to track real-time performance data.
Metric A single value updated programmatically.

Here’s a table of the common types of metrics in OpenTelemetry: |  Counter | A single value| |  Histogram | A distribution of values | Event | A point-in-time log or action recorded within a span. | | Activity (Span) | A time-bound operation with a start and end, with zero or more events. | | Trace | A collection of spans representing a full operation lifecycle across services. | | TraceId | A CorrelationId automatically incorporated by middleware or generated. | | Propagator | Injects and extracts TraceId, typically via headers. | | Exporter | Sends telemetry data to systems (e.g., Prometheus, Jaeger, Zipkin). | | Sampler | Decides which traces to capture and whether to record/export them. | |  AlwaysOnSampler | Records all traces. | |  AlwaysOffSampler | Discards all traces. | |  ParentBasedSampler | Uses the sampling decision of the parent span. | |  TraceIdRatioBasedSampler | Samples a percentage of traces based on a ratio. | | Resource | Metadata describing the entity producing telemetry data (e.g., service name). |

Code Sample

View the code sample here

Relevant NuGet packages

OpenTelemetry is ASP.NET middleware:

Package Description
OpenTelemetry.Extensions.Hosting Provides extensions for integrating OpenTelemetry into ASP.NET Core hosting services.
OpenTelemetry.Instrumentation.AspNetCore Automatically instruments incoming and outgoing HTTP requests in ASP.NET Core applications.
OpenTelemetry.Instrumentation.Runtime Captures metrics about .NET runtime performance (e.g., GC, exceptions).
OpenTelemetry.Instrumentation.Http Instruments outgoing HTTP requests to track their performance and errors.
OpenTelemetry.Exporter.Console Exports telemetry data (metrics, traces, logs) to the console for development and debugging purposes.
OpenTelemetry.Exporter.Prometheus.AspNetCore Exposes metrics in a format Prometheus can scrape, integrating with the Prometheus monitoring system.
OpenTelemetry.Instrumentation.SqlClient Instruments database operations made via SQL client to track performance and errors in SQL queries.

Metrics & Traces to Add to Data API builder

Name Type Description Partition
Request Count Metric (Count) Tracks the number of API requests processed. Per endpoint
Request Duration Metric (Histogram) Measures the time taken to process each API request. Per endpoint
Error Rate Metric (Count) Tracks the number of failed API requests. Per endpoint
DB Query Span Trace Captures the duration of database queries per API request. Per API request
Authorization Check Trace Tracks the time taken to validate user permissions. Per API request
Cache Hit/Miss Event Event Logs when a cache hit or miss occurs during a request. Per cache event
Startup Event Event Records the time and status when the API starts. Global
  1. Per API request: Captures metrics or traces for each individual API request, giving a detailed view of specific executions (e.g., database query times for each call).

  2. Per endpoint: Aggregates metrics or traces based on the API endpoint (e.g., /api1, /api2), providing overall performance stats for specific API routes.

  3. Per cache event: Tracks when cache hits or misses occur, logging each event as it happens.

  4. Global: Applies to the entire application or service (e.g., API startup events), capturing broad, system-wide metrics or events.

Discussion

  1. Should we update Application Insights?
  2. Should we support Prometheus /metrics?
  3. Do we need custom metrics considering what we already have?
  4. Fusion Cache DOES have OpenTelemetry support.
  5. Hot Chocolate DOES have OpenTelemetry support?
  6. /metrics endpoint cannot be path of rest/gql.
  7. How does the user configure this?
JerryNixon commented 1 month ago

https://dateo-software.de/blog/improve-your-applications-observability-with-custom-health-checks

tommasodotNET commented 1 month ago

Hi @JerryNixon, regarding the configuration topics, I think we could try something like:

{
  "runtime": {
    ...
    "telemetry": {
      "otel": {
        "enabled": true,
        "endpoint": "@env('OTEL_EXPORTER_OTLP_ENDPOINT')"
      }
    },
    ...
  }
}

Doing it this way, the OTEL config can live alongside the already existing one for appinsights, and we could handle both config at code level as .NET Aspire does.