elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.72k stars 8.14k forks source link

Respond to graceful shutdown signals #160329

Open kobelb opened 1 year ago

kobelb commented 1 year ago

Feature Description

When Kubernetes needs to terminate a pod (for example, when scaling up/down pods or releasing a new version) it sends the pod a SIGTERM signaling the request for the pod to exit as soon as possible, waits 30 seconds and then sends a SIGKILL forcing the pod to stop. When a Kibana node that runs background tasks receives a SIGTERM, it should immediately stop claiming new tasks and exit as soon as all in progress tasks have stopped.

Business Value

Facilitates zero downtime rolling upgrades and rollbacks, allowing us to roll-out new features to our users more quickly while they continue to use the system without disruption. Additionally, it enables autoscaling Kibana's background task nodes based on their utilization. Customers will no longer need to manually size their Kibana nodes (less effort), resources will be used more efficiently (decreased COGS) and tasks will run with less of a delay (decreased MTTD, MTTR, etc).

Definition of Done

elasticmachine commented 1 year ago

Pinging @elastic/response-ops (Team:ResponseOps)

mikecote commented 1 year ago

I'll close a similar issue in favour of this one => https://github.com/elastic/kibana/issues/117513.

ymao1 commented 1 year ago

Noticed this https://github.com/elastic/kibana/issues/163519 while working on draft PR: https://github.com/elastic/kibana/pull/163453

Dosant commented 3 months ago

We saw a problem with reporting tests in mki that probably can be improved by this. We see regular test failures like this (build) where reports take a too long time to complete.

It takes too long to complete because on the first attempt is unsuccessful as Kibana is shutting down

Here are the logs from the build

Timestamp Message Instance
Jun 7, 2024 @ 06:34:06.000 Starting pre-stop sleep of 120s... kb-background-tasks-kb-7f8fb45494-4b8ks
Jun 7, 2024 @ 06:34:59.627 Kibana is now available kb-background-tasks-kb-7f8fb45494-lf5qh
Jun 7, 2024 @ 06:35:57.345 (First attempt will fail) Scheduled csv_searchsource reporting task. Task ID: task:4f617cae-34f0-44f0-8ac9-afefd4cc97ba. Report ID: 3f658618-bc56-41b7-8196-bf25327c95fc kb-ui-kb-6449c556b4-clp9n
Jun 7, 2024 @ 06:36:06.698 SIGTERM received - initiating shutdown kb-background-tasks-kb-7f8fb45494-4b8ks
Jun 7, 2024 @ 06:36:12.100 Saving execution error for csv_searchsource job 3f658618-bc56-41b7-8196-bf25327c95fc: ReportingError(code: kibana_shutting_down_error) kb-background-tasks-kb-7f8fb45494-4b8ks
Jun 7, 2024 @ 06:40:32.679 (2nd attempt after 5 minutes) Claiming csv_searchsource 3f658618-bc56-41b7-8196-bf25327c95fc [_index: .ds-.kibana-reporting-2024.06.07-000001] [_seq_no: 37] [_primary_term: 1] [attempts: 1] [process_expiration: 2024-06-07T06:44:32.678Z] kb-background-tasks-kb-7f8fb45494-lf5qh

To mitigate we're increasing the test timeout to 10 minutes, so that tests passes if report is generated on 2nd attempt