elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.47k stars 8.04k forks source link

[SLO] Performance testing #179823

Open kdelemme opened 2 months ago

kdelemme commented 2 months ago

Summary

We should invest some time in testing the performance of our SLO solution. Especially:

elasticmachine commented 2 months ago

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

jasonrhodes commented 2 months ago

Let's have a chat with @cachedout and the robots team about how we can set up some kind of, at the very least, nightly perf test that runs SLOs in various scenarios and tracks timing trends, so we can be aware of (a) what our performance baseline is, generally, (b) how that improves over time with new features, and (c) when a new feature introduces a perf regression.

cachedout commented 2 months ago

@jasonrhodes Could you expand a little bit more what you mean by the phrase "timing trends"? What event duration are you measuring? Are you thinking about the behavior of SLOs in isolation or their potential effect of Kibana performance at large?

jasonrhodes commented 2 months ago

@cachedout for the moment, we're thinking about SLOs in isolation. When users create SLOs, it kicks off a series of underlying transforms and Elasticsearch operations, and we currently have no visibility into whether changes to the SLO architecture introduce performance regressions in those flows. We also don't have a good idea of what the baseline performance is of these transforms, and whether we're improving them with any given change, either. But these kinds of things don't fit well within functional tests, likely, and would need to be run outside of the regular CI system, I suspect.

I can schedule a quick call if that would be helpful. Thanks!

cachedout commented 2 months ago

I'd like for you to work closely with @ablnk on this one. I'll let him read this issue and comment on questions he might have. We can still have the call if needed but let's see if we can make progress async before then.

ablnk commented 2 months ago

Hey @jasonrhodes do I understand it correctly that you are interested in tracking ingestion performance, i.e. time from triggering an event to having it ingested (@timestamp time to event.ingested time). Also, any preferences on SLI type to test?

This is how I would approach this task:

  1. Create various SLOs in a persistent environment (such as edge-lite or keepserverless-qa-oblt).
  2. Run an automated script once a day that would run API requests against several SLOs (filtering by slo.id) and return the average of difference between event.ingested and @timestamp.

Does it sound reasonable?

I have been confused with slo.id though. I've just created a new SLO and got its id in a response. Then I filtered data by that id and got data from 2 days ago:

image

Is this a bug or a feature?

jasonrhodes commented 2 months ago

do I understand it correctly that you are interested in tracking ingestion performance, i.e. time from triggering an event to having it ingested (@timestamp time to event.ingested time)

Hm, I'm not sure if that's exactly it. I think we want to make sure that the SLO transform system behaves in a similar way, performance-wise, across features and releases. "How long does it take for the various SLO transforms to run, given a stable dataset, from 'start' to 'finish'?" I'll need the SLO engineers to weigh in on the specifics of what that means, how many different transforms we need to test, etc. cc: @elastic/obs-ux-management-team

kdelemme commented 2 months ago

I think we would like to know how long it takes for the rollup data to be computed, for different set of inputs. For example, let say we have an index with 10M documents spread over a month with a high cardinality field (for example a serverless.projectId). When we create an SLO based on that index for the past 30days, the first transform will search and aggregate the data per serverless.projectId per minute for the past 30days. It will take some time until the data is up to date, e.g. the last @timestamp in the .slo-observability.sli-v3* indices is equal to the last @timestamp of the source data.

The time it takes for this process to complete is what we are interested in measuring. We could measure this by continuously pulling the latest SLI document, and compare it to the last timestamp of the source data. When they match, we are done.

cachedout commented 2 months ago

Have we considered the use of the Transform Stats API for this purpose?

ablnk commented 2 months ago

Just an update of where we're at. Here's a test for sli.apm.transactionDuration and apm.transactionErrorRate SLO types and what it does:

  1. Creates SLOs.
  2. Polls the source index (metrics-apm*,apm-*) for a new document. Once appeared, gets its @timestamp.
  3. Triggers the countdown and polls .slo-observability.sli-v3* indices until a new document with the captured @timestamp in the step#2 appears.
  4. Prints the time it took to appear for a new document with the captured @timestamp in .slo-observability.sli-v3* indices.
  5. Deletes SLO.
[api] › api\slos.api.spec.ts:4:10 › SLO performance tests › sli.apm.transactionDuration
SLO "[Playwright Test] APM latency" has been created.
Waiting for the next document in the source index...
The last @timestamp of the source data: [ '2024-04-22T12:26:00.000Z' ]
slo.id: 0ce63943-071a-4696-b558-49bc0a85fd75
Waiting for the next document in the ".slo-observability.sli-v3*" indices...
SLO "[Playwright Test] APM latency" transforms took: 103260 ms.
Deleting SLO "[Playwright Test] APM latency"...
SLO "[Playwright Test] APM latency" has been deleted.

I plan to add more SLO types and create a GitHub Actions workflow that will trigger these tests once a day against observability test clusters, where we have required data.

Have we considered the use of the Transform Stats API for this purpose?

Question is how do we get <transform_id> for a particular SLO?

ablnk commented 2 months ago

@kdelemme @jasonrhodes here is a workflow that:

  1. Runs performance tests against keepserverless-qa-oblt environment once a day. sli.apm.transactionDuration and sli.apm.transactionErrorRate tests utilize apm-soak datasets. To be more precise: "service":"opbeans-go", "transactionName":"GET /api/customers"

  2. Parses raw test report to get a structure suitable for ES.

  3. Uploads resulting reports to the same cluster where tests being executed. I created a dashboard "[SLO] Transforms performance monitoring)" there, where you can track test results: image

I plan to add more SLO types and set up a workflow to run against stateful deployments.

ablnk commented 2 months ago

Test suite has been extended with sli.histogram.custom SLO type and here's a workflow for a stateful deployment (edge-lite-oblt), and a dashboard&_a=()) where you can track results.

image

jasonrhodes commented 1 month ago

@kdelemme can you take a look at what @ablnk has put together here and weigh in on if this is roughly what you had in mind? Thanks!

kdelemme commented 1 month ago

@ablnk This looks promising! If we could add the time it takes to process the rollup data into the summary data that would be fantastic!

The overall flow looks like: source data -> rollup data .slo-observability.sli-v3* -> summary data .slo-observability.summary-v3*

Right now, we have the time from source to rollup data, which is a really great start, but having the time from rollup to summary would give the full picture. I think it would be interesting to have the two intermediate times: T(source -> rollup) and T(rollup -> summary) so when the total increases, we can find which transform is responsible for the increase.

Question is how do we get for a particular SLO?

Knowing the slo id, you can derive the rollup transform id as slo-{slo.id}-{slo.revision} where revision is 1 when it is created the first time. Similarly, you can derive the summary transform id as slo-summary-{slo.id}-{slo.revision}

ablnk commented 1 month ago

@kdelemme I've updated the test code, now we have the time from source to rollup, from source to summary, from rollup to summary:

image

On stateful, it runs on edge-lite) environment once a day, here's GH Actions workflow. On serverless, it runs on keepserverless-qa-oblt, though there has been some instability lately, tests may fail due to that, here's GH Actions workflow.

@jasonrhodes @kdelemme please check when you have time.

ablnk commented 2 weeks ago

@jasonrhodes JFYI I'm having troubles with creating SLOs of a certain SLI type https://github.com/elastic/kibana/issues/185792

jasonrhodes commented 2 weeks ago

Thanks for reporting, @ablnk -- I've applied the right labels and projects there and we'll look at it soon.