This pull request also includes a new solver store method and two minor refactors to support the work above:
[x] Add GetDealsAll solver store method
[x] Initialize tracer provider with noop provider
[x] Lift resource instantiation to SetupOTelSDK
We would like improved metrics observability on the solver to monitor system and process performance, deal matching stats, and deal status stats.
Task/Issue reference
Closes: #434
Test plan
Start the observability server on the bgins/feat-add-solver-dashboards branch. Open localhost:3000 in a browser and select "Dashboards" from the hamburger menu on the left.
Two dashboards have been added:
Solver System Metrics (system and process metrics)
Solver Metrics (deal state metrics, matcher metrics)
Start the stack. Run the solver with explicit metrics configuration:
Run a job. Metrics are collected and sent in batches once per minute, so it may take a moment before metrics are reported.
Once reported, the Solver System Metrics dashboard should display standard metrics for system and process performance.
The Solver Metrics dashboard displays deal state metrics. For example, in a local run we can see deal states over a couple jobs:
The Solver Metrics dashboard also displays job offer, resource offer, and deal counts.
Details
This pull request adds metrics to the solver. We may want to add the system metrics for the resource provider in the future, but we start with the solver to test the implementation before making it broadly available. It should be easy to add to resource providers in the future, though we may want to revisit some of the configuration decisions made here.
Configuration
This pull request configures metrics with MetricsOptions and MetricsConfig. We have kept configuration separate from TelemetryOptions and TelemetryConfig for a couple of reasons:
We only want metrics config for the solver
We may send metrics and logs to a local collector like FluentBit, while still sending traces directly to the observability instance
We may want to revisit configuration in the future, but this approach is a first iteration while we consider the bigger picture of where and how metrics will be used.
Metrics must be configured explicitly with environment variables or command line options. My thinking here is that these metrics are mostly interesting in production, and it's easy enough to configure them in our deployments.
System Metrics
This pull request includes an initial set of system metrics including:
System uptime
System load average (1, 5, 15)
System CPU used percent and logical CPU count
System memory used percent, used, available, and total
System network connections count, bytes transmitted, bytes received
System filesystem used percent, used, free, total, inodes used percent
Process uptime, memory used percent, CPU used percent, connections count, num threads used
Solver matching metrics
This pull request records the following matcher metrics:
Resource offers available to match
Job offers available to match
Deals made during matching
These metrics are recorded once per solver control loop iteration.
Solver deal state metrics
This pull request records deal state metrics once per solver control loop iteration. This should be considered a form of sampling because deal states are updated outside of control loop iteration.
Eventually, we should implement UpDownCounters for each deal state when we have a better handle on the job lifecycle and its state transitions.
Computing deal states once per control loop iteration may have performance implications. On my local runs with a single job, the traces report ~12ms of time spent computing deal state metrics. Likely to be more in production, we can keep an eye on it.
Summary
This pull request makes the following changes:
This pull request also includes a new solver store method and two minor refactors to support the work above:
GetDealsAll
solver store methodWe would like improved metrics observability on the solver to monitor system and process performance, deal matching stats, and deal status stats.
Task/Issue reference
Closes: #434
Test plan
Start the observability server on the
bgins/feat-add-solver-dashboards
branch. Openlocalhost:3000
in a browser and select "Dashboards" from the hamburger menu on the left.Two dashboards have been added:
Start the stack. Run the solver with explicit metrics configuration:
Run a job. Metrics are collected and sent in batches once per minute, so it may take a moment before metrics are reported.
Once reported, the Solver System Metrics dashboard should display standard metrics for system and process performance.
The Solver Metrics dashboard displays deal state metrics. For example, in a local run we can see deal states over a couple jobs:
The Solver Metrics dashboard also displays job offer, resource offer, and deal counts.
Details
This pull request adds metrics to the solver. We may want to add the system metrics for the resource provider in the future, but we start with the solver to test the implementation before making it broadly available. It should be easy to add to resource providers in the future, though we may want to revisit some of the configuration decisions made here.
Configuration
This pull request configures metrics with
MetricsOptions
andMetricsConfig
. We have kept configuration separate fromTelemetryOptions
andTelemetryConfig
for a couple of reasons:We may want to revisit configuration in the future, but this approach is a first iteration while we consider the bigger picture of where and how metrics will be used.
Metrics must be configured explicitly with environment variables or command line options. My thinking here is that these metrics are mostly interesting in production, and it's easy enough to configure them in our deployments.
System Metrics
This pull request includes an initial set of system metrics including:
Solver matching metrics
This pull request records the following matcher metrics:
These metrics are recorded once per solver control loop iteration.
Solver deal state metrics
This pull request records deal state metrics once per solver control loop iteration. This should be considered a form of sampling because deal states are updated outside of control loop iteration.
Eventually, we should implement UpDownCounters for each deal state when we have a better handle on the job lifecycle and its state transitions.
Computing deal states once per control loop iteration may have performance implications. On my local runs with a single job, the traces report ~12ms of time spent computing deal state metrics. Likely to be more in production, we can keep an eye on it.
Related issues or PRs
Epic: https://github.com/Lilypad-Tech/internal/issues/345 Solver metrics dashboards: https://github.com/Lilypad-Tech/observability/pull/22 Future work: https://github.com/Lilypad-Tech/lilypad/issues/439