Open kobelb opened 1 year ago
Pinging @elastic/response-ops (Team:ResponseOps)
I'll close a similar issue in favour of this one => https://github.com/elastic/kibana/issues/117513.
Noticed this https://github.com/elastic/kibana/issues/163519 while working on draft PR: https://github.com/elastic/kibana/pull/163453
We saw a problem with reporting tests in mki that probably can be improved by this. We see regular test failures like this (build) where reports take a too long time to complete.
It takes too long to complete because on the first attempt is unsuccessful as Kibana is shutting down
Here are the logs from the build
Timestamp | Message | Instance |
---|---|---|
Jun 7, 2024 @ 06:34:06.000 | Starting pre-stop sleep of 120s... | kb-background-tasks-kb-7f8fb45494-4b8ks |
Jun 7, 2024 @ 06:34:59.627 | Kibana is now available | kb-background-tasks-kb-7f8fb45494-lf5qh |
Jun 7, 2024 @ 06:35:57.345 | (First attempt will fail) Scheduled csv_searchsource reporting task. Task ID: task:4f617cae-34f0-44f0-8ac9-afefd4cc97ba. Report ID: 3f658618-bc56-41b7-8196-bf25327c95fc | kb-ui-kb-6449c556b4-clp9n |
Jun 7, 2024 @ 06:36:06.698 | SIGTERM received - initiating shutdown | kb-background-tasks-kb-7f8fb45494-4b8ks |
Jun 7, 2024 @ 06:36:12.100 | Saving execution error for csv_searchsource job 3f658618-bc56-41b7-8196-bf25327c95fc: ReportingError(code: kibana_shutting_down_error) | kb-background-tasks-kb-7f8fb45494-4b8ks |
Jun 7, 2024 @ 06:40:32.679 | (2nd attempt after 5 minutes) Claiming csv_searchsource 3f658618-bc56-41b7-8196-bf25327c95fc [_index: .ds-.kibana-reporting-2024.06.07-000001] [_seq_no: 37] [_primary_term: 1] [attempts: 1] [process_expiration: 2024-06-07T06:44:32.678Z] | kb-background-tasks-kb-7f8fb45494-lf5qh |
To mitigate we're increasing the test timeout to 10 minutes, so that tests passes if report is generated on 2nd attempt
Feature Description
When Kubernetes needs to terminate a pod (for example, when scaling up/down pods or releasing a new version) it sends the pod a SIGTERM signaling the request for the pod to exit as soon as possible, waits 30 seconds and then sends a SIGKILL forcing the pod to stop. When a Kibana node that runs background tasks receives a SIGTERM, it should immediately stop claiming new tasks and exit as soon as all in progress tasks have stopped.
Business Value
Facilitates zero downtime rolling upgrades and rollbacks, allowing us to roll-out new features to our users more quickly while they continue to use the system without disruption. Additionally, it enables autoscaling Kibana's background task nodes based on their utilization. Customers will no longer need to manually size their Kibana nodes (less effort), resources will be used more efficiently (decreased COGS) and tasks will run with less of a delay (decreased MTTD, MTTR, etc).
Definition of Done