Lilypad-Tech / lilypad

Run AI workloads easily in a decentralized GPU network. https://www.youtube.com/watch?v=yQnB2Yxia4Y
https://lilypad.tech
Apache License 2.0
52 stars 16 forks source link

feat: Add solver metrics #435

Open bgins opened 5 days ago

bgins commented 5 days ago

Summary

This pull request makes the following changes:

This pull request also includes a new solver store method and two minor refactors to support the work above:

We would like improved metrics observability on the solver to monitor system and process performance, deal matching stats, and deal status stats.

Task/Issue reference

Closes: #434

Test plan

Start the observability server on the bgins/feat-add-solver-dashboards branch. Open localhost:3000 in a browser and select "Dashboards" from the hamburger menu on the left.

Two dashboards have been added:

Start the stack. Run the solver with explicit metrics configuration:

export ENABLE_METRICS=true
export METRICS_URL="http://localhost:8500"
export METRICS_TOKEN="eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJhdXRob3JpemVkIjp0cnVlLCJ1c2VyIjoicmVzb3VyY2UtcHJvdmlkZXIifQ.n36M_ngwC4XPQ_pEkkWAnPiOinnx6-0VO1v_WgCTUEERD7b_p9KHCU6SY5bUdFh5UXRZHAhc1gfyc7rjAnmeDQ"
./stack solver

Run a job. Metrics are collected and sent in batches once per minute, so it may take a moment before metrics are reported.

Once reported, the Solver System Metrics dashboard should display standard metrics for system and process performance.

CleanShot 2024-11-13 at 16 21 13@2x

The Solver Metrics dashboard displays deal state metrics. For example, in a local run we can see deal states over a couple jobs:

deal-states

The Solver Metrics dashboard also displays job offer, resource offer, and deal counts.

CleanShot 2024-11-13 at 16 20 17@2x

Details

This pull request adds metrics to the solver. We may want to add the system metrics for the resource provider in the future, but we start with the solver to test the implementation before making it broadly available. It should be easy to add to resource providers in the future, though we may want to revisit some of the configuration decisions made here.

Configuration

This pull request configures metrics with MetricsOptions and MetricsConfig. We have kept configuration separate from TelemetryOptions and TelemetryConfig for a couple of reasons:

We may want to revisit configuration in the future, but this approach is a first iteration while we consider the bigger picture of where and how metrics will be used.

Metrics must be configured explicitly with environment variables or command line options. My thinking here is that these metrics are mostly interesting in production, and it's easy enough to configure them in our deployments.

System Metrics

This pull request includes an initial set of system metrics including:

Solver matching metrics

This pull request records the following matcher metrics:

These metrics are recorded once per solver control loop iteration.

Solver deal state metrics

This pull request records deal state metrics once per solver control loop iteration. This should be considered a form of sampling because deal states are updated outside of control loop iteration.

Eventually, we should implement UpDownCounters for each deal state when we have a better handle on the job lifecycle and its state transitions.

Computing deal states once per control loop iteration may have performance implications. On my local runs with a single job, the traces report ~12ms of time spent computing deal state metrics. Likely to be more in production, we can keep an eye on it.

Related issues or PRs

Epic: https://github.com/Lilypad-Tech/internal/issues/345 Solver metrics dashboards: https://github.com/Lilypad-Tech/observability/pull/22 Future work: https://github.com/Lilypad-Tech/lilypad/issues/439