elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.52k stars 24.9k forks source link

Missing repository breaks Stack monitoring #103752

Closed psanz-estc closed 2 weeks ago

psanz-estc commented 11 months ago

Elasticsearch Version

8.11

Installed Plugins

No response

Java Version

bundled

OS Version

Rocky Linux 8.8

Problem Description

ES nodes were not listed in the Nodes tab in Elasticsearch monitoring (using metricbeat)

image

There wasn't any evident error until we checked node_stats API call, which returned:

      {
        "type": "failed_node_exception",
        "reason": "Failed node [LpMeXt8VR66qc6kVsBGA0w]",
        "node_id": "LpMeXt8VR66qc6kVsBGA0w",
        "caused_by": {
          "type": "repository_exception",
          "reason": "[ESRepo] repository type [fs] failed to create on current node",
          "caused_by": {
            "type": "repository_exception",
            "reason": "[ESRepo] failed to create repository",
            "caused_by": {
              "type": "repository_exception",
              "reason": "[ESRepo] location [\\\\node01.net\\ESRepo] doesn't match any of the locations specified by path.repo because this setting is empty"
            }
          }
        }
      }

ESRepo repository shouldn't be there, and it seems it was causing the node_stats call to "fail" and inadvertently, the Elastic Stack monitoring page to return an empty list

As soon as we removed this from "Stack Management \ Snapshot and Restore \ Repositories" the nodes under the "Nodes" tab showed up immediately

Steps to Reproduce

Create/configure a FS repo with an empty path.repo defined Check node stats API The API call will return a failed_node_exception due to doesn't match any of the locations specified by path.repo because this setting is empty Elastic Stack monitoring won't show any of the nodes in the Node tab

Logs (if relevant)

No response

elasticsearchmachine commented 11 months ago

Pinging @elastic/es-distributed (Team:Distributed)

volodk85 commented 10 months ago

@psanz-estc may I know how you getting this?

Create/configure a FS repo with an empty path.repo defined

According to the code fs path resolution is happening using path.repo settings regardless absolute or relative path is specified, meaning to define fs repo you have to set correct path.repo settings in the first place.

ywangd commented 10 months ago

Yeah, we need non-empty path.repo to create a repository. However, the error can happen when the node restarts with the path.repo setting removed or new node without path.repo setting joins the cluster. I wonder whether the original report is either of these cases?

ywangd commented 3 weeks ago

@psanz-estc I took a look at this issue again. I think this is more of a problem for the " Elastic Stack monitoring page" rather for Elasticsearch itself. Assuming the "Elastic Stack monitoring page" uses the NodesStats API for its UI, it should reflect the fact that there are node level failure in the API's response instead of silently ignore them. The response contains information about:

  1. Number of total nodes
  2. Number of successful nodes
  3. Number of failed nodes with their correspond node IDs and failures

An example is as the follows:

{
  "_nodes": {
    "total": 1,
    "successful": 0,
    "failed": 1,
    "failures": [
      {
        "type": "failed_node_exception",
        "reason": "Failed node [GWR8kxDlSqy2D-SyuKKXPA]",
        "node_id": "GWR8kxDlSqy2D-SyuKKXPA",
        "caused_by": {
          "type": "repository_exception",
          "reason": "[my_fs_repository] repository type [fs] failed to create on current node",
          "caused_by": {
            "type": "repository_exception",
            "reason": "[my_fs_repository] failed to create repository",
            "caused_by": {
              "type": "repository_exception",
              "reason": "[my_fs_repository] location [fs-repository] doesn't match any of the locations specified by path.repo because this setting is empty"
            }
          }
        }
      }
    ]
  },
  "cluster_name": "runTask",
  "nodes": {}
}

It contains enough information for the "Elastic Stack monitoring page" to indicate that a node is failing to respond. Alternatively, the monitoring page can specify the metrics that it mostly interests in the NodesStats API call to avoid checking repositories if it is not necessary. Hence I think the Elasticsearch side works as intended. I suggest that you follow this up with the team who owns the monitoring page. I plan to close this issue if you are OK with it.

PS: We can have a separate discussion on whether one metric failure should fail the entire NodeStats response. But that is a very different topic which is about how we report the error instead of whether the error is reported. It also does not really help with the monitoring page by itself. I'd argue it could make it worse because things would appear to be alright with the underlying exception goes unnoticed. At least the current situation makes you notice some nodes are missing.

ywangd commented 2 weeks ago

Closing this issue as detailed in the above message. Thanks for reporting.