feat: Add solver metrics

Summary

This pull request makes the following changes:

[x] Add solver metrics options
[x] Add metrics config and provider
[x] Configure solver with metrics
[x] Add meter to solver controller
[x] Add solver system metrics
[x] Add solver deal state metrics
[x] Add solver matcher metrics

This pull request also includes a new solver store method and two minor refactors to support the work above:

[x] Add GetDealsAll solver store method
[x] Initialize tracer provider with noop provider
[x] Lift resource instantiation to SetupOTelSDK

We would like improved metrics observability on the solver to monitor system and process performance, deal matching stats, and deal status stats.

Task/Issue reference

Closes: #434

Test plan

Start the observability server on the bgins/feat-add-solver-dashboards branch. Open localhost:3000 in a browser and select "Dashboards" from the hamburger menu on the left.

Two dashboards have been added:

Solver System Metrics (system and process metrics)
Solver Metrics (deal state metrics, matcher metrics)

Start the stack. Run the solver with explicit metrics configuration:

export ENABLE_METRICS=true
export METRICS_URL="http://localhost:8500"
export METRICS_TOKEN="eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJhdXRob3JpemVkIjp0cnVlLCJ1c2VyIjoicmVzb3VyY2UtcHJvdmlkZXIifQ.n36M_ngwC4XPQ_pEkkWAnPiOinnx6-0VO1v_WgCTUEERD7b_p9KHCU6SY5bUdFh5UXRZHAhc1gfyc7rjAnmeDQ"
./stack solver

Run a job. Metrics are collected and sent in batches once per minute, so it may take a moment before metrics are reported.

Once reported, the Solver System Metrics dashboard should display standard metrics for system and process performance.

CleanShot 2024-11-13 at 16 21 13@2x

The Solver Metrics dashboard displays deal state metrics. For example, in a local run we can see deal states over a couple jobs:

deal-states

The Solver Metrics dashboard also displays job offer, resource offer, and deal counts.

CleanShot 2024-11-13 at 16 20 17@2x

Details

This pull request adds metrics to the solver. We may want to add the system metrics for the resource provider in the future, but we start with the solver to test the implementation before making it broadly available. It should be easy to add to resource providers in the future, though we may want to revisit some of the configuration decisions made here.

Configuration

This pull request configures metrics with MetricsOptions and MetricsConfig. We have kept configuration separate from TelemetryOptions and TelemetryConfig for a couple of reasons:

We only want metrics config for the solver
We may send metrics and logs to a local collector like FluentBit, while still sending traces directly to the observability instance

We may want to revisit configuration in the future, but this approach is a first iteration while we consider the bigger picture of where and how metrics will be used.

Metrics must be configured explicitly with environment variables or command line options. My thinking here is that these metrics are mostly interesting in production, and it's easy enough to configure them in our deployments.

System Metrics

This pull request includes an initial set of system metrics including:

System uptime
System load average (1, 5, 15)
System CPU used percent and logical CPU count
System memory used percent, used, available, and total
System network connections count, bytes transmitted, bytes received
System filesystem used percent, used, free, total, inodes used percent
Process uptime, memory used percent, CPU used percent, connections count, num threads used

Solver matching metrics

This pull request records the following matcher metrics:

Resource offers available to match
Job offers available to match
Deals made during matching

These metrics are recorded once per solver control loop iteration.

Solver deal state metrics

This pull request records deal state metrics once per solver control loop iteration. This should be considered a form of sampling because deal states are updated outside of control loop iteration.

Eventually, we should implement UpDownCounters for each deal state when we have a better handle on the job lifecycle and its state transitions.

Computing deal states once per control loop iteration may have performance implications. On my local runs with a single job, the traces report ~12ms of time spent computing deal state metrics. Likely to be more in production, we can keep an eye on it.

Related issues or PRs

Epic: https://github.com/Lilypad-Tech/internal/issues/345 Solver metrics dashboards: https://github.com/Lilypad-Tech/observability/pull/22 Future work: https://github.com/Lilypad-Tech/lilypad/issues/439

Lilypad-Tech / lilypad