elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.35k stars 7.98k forks source link

[ResponseOps] implement task claiming strategy mget #180485

Open pmuellr opened 1 month ago

pmuellr commented 1 month ago

resolves: https://github.com/elastic/kibana/issues/181325

Summary

Adds a new task claiming strategy mget, which can be used instead of the default one default. Add the following to your kibana.yml to enable it:

xpack.task_manager.claim_strategy: 'mget'

TODO

Deferred TODOs

To Verify

A command-line tool is available in x-pack/plugins/task_manager/server/manual_tests/get_rule_run_event_logs.js. It is used to pull rule run execution documents from multiple clusters at once, and provide some augmented info in them, regarding workers, idle time, etc. The idea is to do an A/B test of using the new task claimer vs default, then see how the runs compare.

pmuellr commented 2 weeks ago

/ci

pmuellr commented 1 week ago

Taking qaf for a spin ...

$ qaf rac alert-load \
--rule-count     200 \
--rule-interval  1m \
--run-minutes    10 \
--percent-firing  0 \
--es-url         https://keepkibana-pr-180485-elasticsearch-ea83fb.es.eu-west-1.aws.qa.elastic.cloud \
--kibana-url     https://keepkibana-pr-180485-elasticsearch-ea83fb.kb.eu-west-1.aws.qa.elastic.cloud \
--username       testing-internal \
--password       [secret-here]

That command will create 200 rules for 10m, and then produce some Dashboards showing some stats. Nice! The TM stats don't seem to be there, guessing that's because the TM health report is not really available for serverless. I'm going to take a look at the event log directly instead ...

pmuellr commented 1 week ago

/ci

pmuellr commented 1 week ago

I removed the cluster / project auto-deployments - they're hard to control, I figure using the custom images will be good enough.

pmuellr commented 1 week ago

/ci

pmuellr commented 1 week ago

/ci

pmuellr commented 1 week ago

I changed the default task claimer to the one implemented here, to make it easy to test in cloud without overrides.

Interesting to see the FT failures - I thought there would be more!

pmuellr commented 1 week ago
image

First time trying on serverless. This is the count of "stale" messages the claimer found from it's search, after the mget, with three background instances. Curious only two report stale entries, seems like a timing thing. I would guess over time, since we randomize the interval a tiny bit, that these "stale" finds will migrate across the kibanas. Will be fun to see over a longer time frame.

The graph below is per second. Super interesting that the number of stales is usually around 10. Which also makes sense. The second Kibana found the same 10 the previous Kibana found, but the previous one claimed them so the mget marked them stale. I guess as we scale up, we may see 20 as a number, or maybe chaos kinda takes over at that point.

Screenshot 2024-05-08 at 10 08 24 AM
pmuellr commented 1 week ago

/ci

pmuellr commented 5 days ago

FYI: in commit e2a9b4a6e246fe3afe07cb603273faeab0e0f013 I changed the default strategy so that it's the default, no longer the mget one. In the subsequent commit caa444fbcaa3169e94dfa92e9a74926c3a999af9 the value to use in the config changed from mget to unsafe_mget. The config property remains the same, xpack.task_manager.claim_strategy.

This means we'll need to use an override to set the claimer, which isn't hard for ESS, but I believe requires per-project overrides, so kind of a pain. Not sure there's an easier way to do this ...

pmuellr commented 5 days ago

/ci

pmuellr commented 4 days ago

/ci

pmuellr commented 3 days ago

@elasticmachine merge upstream

pmuellr commented 3 days ago

/ci

kibana-ci commented 3 days ago

:broken_heart: Build Failed

Failed CI Steps

Test Failures

Metrics [docs]

Canvas Sharable Runtime

The Canvas "shareable runtime" is an bundle produced to enable running Canvas workpads outside of Kibana. This bundle is included in third-party webpages that embed canvas and therefor should be as slim as possible.

id before after diff
module count - 5405 +5405
total size - 8.8MB +8.8MB

History

To update your PR or re-run it, just comment with: @elasticmachine merge upstream