Open pmuellr opened 1 month ago
/ci
Taking qaf
for a spin ...
$ qaf rac alert-load \
--rule-count 200 \
--rule-interval 1m \
--run-minutes 10 \
--percent-firing 0 \
--es-url https://keepkibana-pr-180485-elasticsearch-ea83fb.es.eu-west-1.aws.qa.elastic.cloud \
--kibana-url https://keepkibana-pr-180485-elasticsearch-ea83fb.kb.eu-west-1.aws.qa.elastic.cloud \
--username testing-internal \
--password [secret-here]
That command will create 200 rules for 10m, and then produce some Dashboards showing some stats. Nice! The TM stats don't seem to be there, guessing that's because the TM health report is not really available for serverless. I'm going to take a look at the event log directly instead ...
/ci
I removed the cluster / project auto-deployments - they're hard to control, I figure using the custom images will be good enough.
/ci
/ci
I changed the default task claimer to the one implemented here, to make it easy to test in cloud without overrides.
Interesting to see the FT failures - I thought there would be more!
First time trying on serverless. This is the count of "stale" messages the claimer found from it's search, after the mget, with three background instances. Curious only two report stale entries, seems like a timing thing. I would guess over time, since we randomize the interval a tiny bit, that these "stale" finds will migrate across the kibanas. Will be fun to see over a longer time frame.
The graph below is per second. Super interesting that the number of stales is usually around 10. Which also makes sense. The second Kibana found the same 10 the previous Kibana found, but the previous one claimed them so the mget marked them stale. I guess as we scale up, we may see 20 as a number, or maybe chaos kinda takes over at that point.
/ci
FYI: in commit e2a9b4a6e246fe3afe07cb603273faeab0e0f013 I changed the default strategy so that it's the default, no longer the mget one. In the subsequent commit caa444fbcaa3169e94dfa92e9a74926c3a999af9 the value to use in the config changed from mget
to unsafe_mget
. The config property remains the same, xpack.task_manager.claim_strategy
.
This means we'll need to use an override to set the claimer, which isn't hard for ESS, but I believe requires per-project overrides, so kind of a pain. Not sure there's an easier way to do this ...
/ci
/ci
@elasticmachine merge upstream
/ci
docker.elastic.co/kibana-ci/kibana-serverless:pr-180485-1476a2c0c602
The Canvas "shareable runtime" is an bundle produced to enable running Canvas workpads outside of Kibana. This bundle is included in third-party webpages that embed canvas and therefor should be as slim as possible.
id | before | after | diff |
---|---|---|---|
module count |
- | 5405 | +5405 |
total size |
- | 8.8MB | +8.8MB |
To update your PR or re-run it, just comment with:
@elasticmachine merge upstream
resolves: https://github.com/elastic/kibana/issues/181325
Summary
Adds a new task claiming strategy
mget
, which can be used instead of the default onedefault
. Add the following to yourkibana.yml
to enable it:TODO
Deferred TODOs
To Verify
A command-line tool is available in
x-pack/plugins/task_manager/server/manual_tests/get_rule_run_event_logs.js
. It is used to pull rule run execution documents from multiple clusters at once, and provide some augmented info in them, regarding workers, idle time, etc. The idea is to do an A/B test of using the new task claimer vs default, then see how the runs compare.