[ResponseOps] implement task claiming strategy mget - Githubissues

elastic / kibana

Your window into the Elastic Stack

https://www.elastic.co/products/kibana

Other

19.35k stars 7.98k forks source link

[ResponseOps] implement task claiming strategy mget #180485

Open pmuellr opened 1 month ago

pmuellr commented 1 month ago

resolves: https://github.com/elastic/kibana/issues/181325

Summary

Adds a new task claiming strategy mget, which can be used instead of the default one default. Add the following to your kibana.yml to enable it:

xpack.task_manager.claim_strategy: 'mget'

TODO

[x] change config strategy name to include "unsafe" or similar at beginning
[ ] figure out story with task types to skip (not sure anything special needed here)
[x] fix function tests (they were only breaking when mget was the default)
[ ] ???

Deferred TODOs

[ ] complete jest and function tests
[ ] trying bulk updating more non-stale tasks if conflicts in bulk update

To Verify

A command-line tool is available in x-pack/plugins/task_manager/server/manual_tests/get_rule_run_event_logs.js. It is used to pull rule run execution documents from multiple clusters at once, and provide some augmented info in them, regarding workers, idle time, etc. The idea is to do an A/B test of using the new task claimer vs default, then see how the runs compare.

pmuellr commented 2 weeks ago

/ci

pmuellr commented 1 week ago

Taking qaf for a spin ...

$ qaf rac alert-load \
--rule-count     200 \
--rule-interval  1m \
--run-minutes    10 \
--percent-firing  0 \
--es-url         https://keepkibana-pr-180485-elasticsearch-ea83fb.es.eu-west-1.aws.qa.elastic.cloud \
--kibana-url     https://keepkibana-pr-180485-elasticsearch-ea83fb.kb.eu-west-1.aws.qa.elastic.cloud \
--username       testing-internal \
--password       [secret-here]

That command will create 200 rules for 10m, and then produce some Dashboards showing some stats. Nice! The TM stats don't seem to be there, guessing that's because the TM health report is not really available for serverless. I'm going to take a look at the event log directly instead ...

pmuellr commented 1 week ago

/ci

pmuellr commented 1 week ago

I removed the cluster / project auto-deployments - they're hard to control, I figure using the custom images will be good enough.

pmuellr commented 1 week ago

/ci

pmuellr commented 1 week ago

/ci

pmuellr commented 1 week ago

I changed the default task claimer to the one implemented here, to make it easy to test in cloud without overrides.

Interesting to see the FT failures - I thought there would be more!

pmuellr commented 1 week ago

First time trying on serverless. This is the count of "stale" messages the claimer found from it's search, after the mget, with three background instances. Curious only two report stale entries, seems like a timing thing. I would guess over time, since we randomize the interval a tiny bit, that these "stale" finds will migrate across the kibanas. Will be fun to see over a longer time frame.

The graph below is per second. Super interesting that the number of stales is usually around 10. Which also makes sense. The second Kibana found the same 10 the previous Kibana found, but the previous one claimed them so the mget marked them stale. I guess as we scale up, we may see 20 as a number, or maybe chaos kinda takes over at that point.

Screenshot 2024-05-08 at 10 08 24 AM

pmuellr commented 1 week ago

/ci

pmuellr commented 5 days ago

FYI: in commit e2a9b4a6e246fe3afe07cb603273faeab0e0f013 I changed the default strategy so that it's the default, no longer the mget one. In the subsequent commit caa444fbcaa3169e94dfa92e9a74926c3a999af9 the value to use in the config changed from mget to unsafe_mget. The config property remains the same, xpack.task_manager.claim_strategy.

This means we'll need to use an override to set the claimer, which isn't hard for ESS, but I believe requires per-project overrides, so kind of a pain. Not sure there's an easier way to do this ...

pmuellr commented 5 days ago

/ci

pmuellr commented 4 days ago

/ci

pmuellr commented 3 days ago

@elasticmachine merge upstream

pmuellr commented 3 days ago

/ci

kibana-ci commented 3 days ago

:broken_heart: Build Failed

Buildkite Build
Commit: 1476a2c0c602ce450852ed4d435b51bd67a1e1bb
Interpreting CI Failures
Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-180485-1476a2c0c602

Failed CI Steps

Test Failures

[job] [logs] FTR Configs #6 / Alerts - Group 3 - schedule circuit breaker alerts getScheduleFrequency no_kibana_privileges at space1 should get the total and remaining schedule frequency
[job] [logs] FTR Configs #6 / Alerts - Group 3 - schedule circuit breaker alerts getScheduleFrequency no_kibana_privileges at space1 should get the total and remaining schedule frequency
[job] [logs] Jest Tests #13 / getScheduleFrequency() should handle empty bucket correctly
[job] [logs] Jest Tests #13 / getScheduleFrequency() should handle empty bucket correctly
[job] [logs] Jest Tests #13 / getScheduleFrequency() should handle malformed schedule interval correctly
[job] [logs] Jest Tests #13 / getScheduleFrequency() should handle malformed schedule interval correctly
[job] [logs] Jest Tests #13 / getScheduleFrequency() should not go below 0 for remaining schedules
[job] [logs] Jest Tests #13 / getScheduleFrequency() should not go below 0 for remaining schedules
[job] [logs] Jest Tests #13 / getScheduleFrequency() should return the correct schedule frequency results
[job] [logs] Jest Tests #13 / getScheduleFrequency() should return the correct schedule frequency results
[job] [logs] FTR Configs #2 / lens app - group 6 lens reporting PNG report should be able to download report of the current visualization
[job] [logs] Jest Tests #13 / validateScheduleLimit should return interval if the previous interval was modified to exceed the limit
[job] [logs] Jest Tests #13 / validateScheduleLimit should return interval if the previous interval was modified to exceed the limit
[job] [logs] Jest Tests #13 / validateScheduleLimit should return interval if the updated interval exceeds limits
[job] [logs] Jest Tests #13 / validateScheduleLimit should return interval if the updated interval exceeds limits

Metrics [docs]

Canvas Sharable Runtime

The Canvas "shareable runtime" is an bundle produced to enable running Canvas workpads outside of Kibana. This bundle is included in third-party webpages that embed canvas and therefor should be as slim as possible.

id	before	after	diff
`module count`	-	5405	+5405
`total size`	-	8.8MB	+8.8MB

History

:broken_heart: Build #210004 failed 11b028c3115c11e9a0bcbf311e76f1e2c791db15
:broken_heart: Build #209683 failed 04baedb48f6115d0f03d3d79375fc1a9e4031433
:broken_heart: Build #209103 failed 612950a89d447fe7c05e3e34bee0a573af88efd4
:broken_heart: Build #208549 failed d4b127ec6b8d6ed120a512d23e7c0aec7865a27a

To update your PR or re-run it, just comment with: @elasticmachine merge upstream