filecoin-project / retrieval-load-testing

Other
4 stars 1 forks source link

Hit Prometheus URL for CPU & Memory metrics from Boost at end of test #6

Open hannahhoward opened 1 year ago

hannahhoward commented 1 year ago

What

We need to collect memory and CPU usage from boost at the end of the test. Boost can expose these values via prometheus (when enabled). We'd like to collect these at the end of the test run.

Suggested implementation

Add new env vars in the .env file: BOOST_PROMETHEUS_URL

Add a tear down function to the script: https://k6.io/docs/using-k6/test-lifecycle/

If BOOST_PROMETHEUS_URL is present, hit it to collect avg CPU usage and avg Memory usage. (maybe also collect a median + 90%?). You'll need to figure out the right metric names and the right endpoints to hit to properly query. You may need to record the test start time in setup() so it's available in teardown so you'll be able to get the right time range to query.

Once you have the values, recommend saving to a "Metric.Guage" (https://k6.io/docs/javascript-api/k6-metrics/gauge/), which should put it into the summary report under a "last value" (bit of a hack-ish way to save a single value into the summary)

Acceptance criteria

CPU & Memory usage are include in the JSON output of the CSV summary report.

kylehuntsman commented 1 year ago

After some clarification from @hannah, the goal of this ticket was to collect CPU and Memory usage metrics on a per test basis so that we could potentially see results along the lines of "CPU/Memory is abundant with 10 users, but with 100 you max out CPU".

CPU and Memory usage stats were going to pulled from the prometheus server running on filcollins at localhost:9090/api/v1/query, using the HTTP API to query for process_cpu_seconds_total and go_memstats_alloc_bytes_total from the boost instance for the runtime of each test. Those values would be added to K6 gauge metrics for both boost and raw retreivals.

The only way to actually make an http call from K6 is to use it's http library. It's http library automagically records metrics towards the test results, so the prometheus queries would count towards metric values. The only way we can isolate these requests from the test metrics is to put them in the K6 teardown lifecycle function, which only runs after all tests have completed, then use a hack to tag them differently in the summary, and then extract the special tagged metric from the summary instead of the default metric. An obvious problem with this is we no longer get metrics on a per test basis. Longer running, more concurrent tests will inflate the metric results and not allow us to see usage on a per test basis. We could make the prometheus query during the test, but we'd have no way of isolating this http request from the one we're testing. Less importantly but still relevant, it messes with the summary handling code in an ugly way since we also then have to extract those specially tagged metrics and provide them to the CSV output.

An additional problem with this approach for shorter running tests is that the prometheus server on boost is scraping the boost metrics exporter endpoint every 15 seconds. This would lead shorter runner range requests to possibly lose out on those metrics since the test would complete before prometheus even had time to scrape the results.

Either some compromises need to be made or another set of tools needs to be used to acquire these metrics. I left the minimal amount of code I added in the feat/cpu-mem-stats branch in the event someone picks this back up and still finds my minor updates relevant.