Open kdelemme opened 2 months ago
Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)
Let's have a chat with @cachedout and the robots team about how we can set up some kind of, at the very least, nightly perf test that runs SLOs in various scenarios and tracks timing trends, so we can be aware of (a) what our performance baseline is, generally, (b) how that improves over time with new features, and (c) when a new feature introduces a perf regression.
@jasonrhodes Could you expand a little bit more what you mean by the phrase "timing trends"? What event duration are you measuring? Are you thinking about the behavior of SLOs in isolation or their potential effect of Kibana performance at large?
@cachedout for the moment, we're thinking about SLOs in isolation. When users create SLOs, it kicks off a series of underlying transforms and Elasticsearch operations, and we currently have no visibility into whether changes to the SLO architecture introduce performance regressions in those flows. We also don't have a good idea of what the baseline performance is of these transforms, and whether we're improving them with any given change, either. But these kinds of things don't fit well within functional tests, likely, and would need to be run outside of the regular CI system, I suspect.
I can schedule a quick call if that would be helpful. Thanks!
I'd like for you to work closely with @ablnk on this one. I'll let him read this issue and comment on questions he might have. We can still have the call if needed but let's see if we can make progress async before then.
Hey @jasonrhodes do I understand it correctly that you are interested in tracking ingestion performance, i.e. time from triggering an event to having it ingested (@timestamp
time to event.ingested
time).
Also, any preferences on SLI type to test?
This is how I would approach this task:
slo.id
) and return the average of difference between event.ingested
and @timestamp
.Does it sound reasonable?
I have been confused with slo.id
though. I've just created a new SLO and got its id
in a response. Then I filtered data by that id
and got data from 2 days ago:
Is this a bug or a feature?
do I understand it correctly that you are interested in tracking ingestion performance, i.e. time from triggering an event to having it ingested (
@timestamp
time toevent.ingested
time)
Hm, I'm not sure if that's exactly it. I think we want to make sure that the SLO transform system behaves in a similar way, performance-wise, across features and releases. "How long does it take for the various SLO transforms to run, given a stable dataset, from 'start' to 'finish'?" I'll need the SLO engineers to weigh in on the specifics of what that means, how many different transforms we need to test, etc. cc: @elastic/obs-ux-management-team
I think we would like to know how long it takes for the rollup data to be computed, for different set of inputs.
For example, let say we have an index with 10M documents spread over a month with a high cardinality field (for example a serverless.projectId
). When we create an SLO based on that index for the past 30days, the first transform will search and aggregate the data per serverless.projectId
per minute for the past 30days. It will take some time until the data is up to date, e.g. the last @timestamp in the .slo-observability.sli-v3*
indices is equal to the last @timestamp of the source data.
The time it takes for this process to complete is what we are interested in measuring. We could measure this by continuously pulling the latest SLI document, and compare it to the last timestamp of the source data. When they match, we are done.
Have we considered the use of the Transform Stats API for this purpose?
Just an update of where we're at. Here's a test for sli.apm.transactionDuration
and apm.transactionErrorRate
SLO types and what it does:
metrics-apm*,apm-*
) for a new document. Once appeared, gets its @timestamp..slo-observability.sli-v3*
indices until a new document with the captured @timestamp in the step#2 appears..slo-observability.sli-v3*
indices.[api] › api\slos.api.spec.ts:4:10 › SLO performance tests › sli.apm.transactionDuration
SLO "[Playwright Test] APM latency" has been created.
Waiting for the next document in the source index...
The last @timestamp of the source data: [ '2024-04-22T12:26:00.000Z' ]
slo.id: 0ce63943-071a-4696-b558-49bc0a85fd75
Waiting for the next document in the ".slo-observability.sli-v3*" indices...
SLO "[Playwright Test] APM latency" transforms took: 103260 ms.
Deleting SLO "[Playwright Test] APM latency"...
SLO "[Playwright Test] APM latency" has been deleted.
I plan to add more SLO types and create a GitHub Actions workflow that will trigger these tests once a day against observability test clusters, where we have required data.
Have we considered the use of the Transform Stats API for this purpose?
Question is how do we get <transform_id>
for a particular SLO?
@kdelemme @jasonrhodes here is a workflow that:
Runs performance tests against keepserverless-qa-oblt environment once a day.
sli.apm.transactionDuration
and sli.apm.transactionErrorRate
tests utilize apm-soak datasets. To be more precise:
"service":"opbeans-go",
"transactionName":"GET /api/customers"
Parses raw test report to get a structure suitable for ES.
Uploads resulting reports to the same cluster where tests being executed. I created a dashboard "[SLO] Transforms performance monitoring)" there, where you can track test results:
I plan to add more SLO types and set up a workflow to run against stateful deployments.
Test suite has been extended with sli.histogram.custom
SLO type and here's a workflow for a stateful deployment (edge-lite-oblt), and a dashboard&_a=()) where you can track results.
@kdelemme can you take a look at what @ablnk has put together here and weigh in on if this is roughly what you had in mind? Thanks!
@ablnk This looks promising! If we could add the time it takes to process the rollup data into the summary data that would be fantastic!
The overall flow looks like:
source data -> rollup data .slo-observability.sli-v3*
-> summary data .slo-observability.summary-v3*
Right now, we have the time from source to rollup data, which is a really great start, but having the time from rollup to summary would give the full picture. I think it would be interesting to have the two intermediate times: T(source -> rollup) and T(rollup -> summary) so when the total increases, we can find which transform is responsible for the increase.
Question is how do we get
for a particular SLO?
Knowing the slo id, you can derive the rollup transform id as slo-{slo.id}-{slo.revision}
where revision is 1 when it is created the first time. Similarly, you can derive the summary transform id as slo-summary-{slo.id}-{slo.revision}
@kdelemme I've updated the test code, now we have the time from source to rollup, from source to summary, from rollup to summary:
On stateful, it runs on edge-lite) environment once a day, here's GH Actions workflow. On serverless, it runs on keepserverless-qa-oblt, though there has been some instability lately, tests may fail due to that, here's GH Actions workflow.
@jasonrhodes @kdelemme please check when you have time.
@jasonrhodes JFYI I'm having troubles with creating SLOs of a certain SLI type https://github.com/elastic/kibana/issues/185792
Thanks for reporting, @ablnk -- I've applied the right labels and projects there and we'll look at it soon.
Summary
We should invest some time in testing the performance of our SLO solution. Especially: