elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.98k stars 24.75k forks source link

Dead cluster after updating `xpack.notification.slack.default_account` to an account that does not exist #115298

Open romain-chanu opened 4 days ago

romain-chanu commented 4 days ago

Elasticsearch Version

8.15.3

Installed Plugins

No response

Java Version

bundled

OS Version

N/A

Problem Description

This has been observed in the field and the problem is reproducible.

Updating xpack.notification.slack.default_account (c.f Slack Notification Settings) to an account that does not exist via the cluster update settings API leads to a dead cluster (c.f steps to reproduce below)

It is questionable whether this should be a dynamic setting as this documentation states that:

You can no longer configure Slack accounts using elasticsearch.yml settings. Please use Elasticsearch’s secure [keystore](https://www.elastic.co/guide/en/elasticsearch/reference/current/secure-settings.html) method instead.

Notice as well that in this below example, the default account value contains spaces. We could not find any workaround to recover from this situation (AFAIK it is impossible to configure the account in the elasticsearch.yml file or to define the secure URL in the keystore while the account name has a space in it)

Steps to Reproduce

  1. Create a deployment in ESS with 2AZ for the hot data and content tier

  2. Run the below API and notice that the API is successfully acknowledged:

PUT _cluster/settings
{
    "persistent": {
        "xpack.notification.slack.default_account": "Slack Alerts"
    }
}
  1. Run GET _cluster/settings and observe that xpack.notification.slack.default_account is not in the result

  2. Run the below API to reset the setting:

PUT _cluster/settings
{
    "persistent": {
        "xpack.notification.slack.default_account": ""
    }
}

and observe the below error:

{
  "error": {
    "root_cause": [
      {
        "type": "not_master_exception",
        "reason": "no longer master"
      }
    ],
    "type": "master_not_discovered_exception",
    "reason": "org.elasticsearch.cluster.NotMasterException: no longer master",
    "caused_by": {
      "type": "not_master_exception",
      "reason": "no longer master"
    }
  },
  "status": 503
}
  1. Check the logs and observe that:

a) Master node keeps changing (c.f master node changed event logs)

b) All nodes are reporting similar log message

[tiebreaker-0000000002] failed to apply settings org.elasticsearch.common.settings.SettingsException: could not find default account [Slack Alerts] at org.elasticsearch.xpack.watcher.notification.NotificationService.findDefaultAccountOrNull(NotificationService.java:178) ~[?:?] at org.elasticsearch.xpack.watcher.notification.NotificationService.buildAccounts(NotificationService.java:107) ~[?:?] at org.elasticsearch.xpack.watcher.notification.NotificationService.clusterSettingsConsumer(NotificationService.java:77) ~[?:?] at org.elasticsearch.common.settings.Setting$2.apply(Setting.java:850) ~[elasticsearch-8.15.3.jar:?] at org.elasticsearch.common.settings.Setting$2.apply(Setting.java:822) ~[elasticsearch-8.15.3.jar:?] at org.elasticsearch.common.settings.AbstractScopedSettings$SettingUpdater.lambda$updater$0(AbstractScopedSettings.java:654) ~[elasticsearch-8.15.3.jar:?] at org.elasticsearch.common.settings.AbstractScopedSettings.applySettings(AbstractScopedSettings.java:174) ~[elasticsearch-8.15.3.jar:?] at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:498) ~[elasticsearch-8.15.3.jar:?] at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) ~[elasticsearch-8.15.3.jar:?] at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:156) ~[elasticsearch-8.15.3.jar:?] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917) ~[elasticsearch-8.15.3.jar:?] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:217) ~[elasticsearch-8.15.3.jar:?] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:183) ~[elasticsearch-8.15.3.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?] at java.lang.Thread.run(Thread.java:1570) ~[?:?]

Logs (if relevant)

No response

elasticsearchmachine commented 4 days ago

Pinging @elastic/es-data-management (Team:Data Management)