Open n0othing opened 2 years ago
Pinging @elastic/es-data-management (Team:Data Management)
This could be an interesting indicator for the new health API. Eventually the API will be able to report on master node connectivity issues (among other things). I could see there being a node-to-node-connectivity indicator of some sort that ensures that transport connections to all other nodes are functional and remain so over X period of time.
In order to make a case for this in the health api, any problems that we check for should ideally be resolvable with advice that can be produced from within Elasticsearch. The certificate problem you mention is a good example: "Fix your trust settings, here's a general troubleshooting guide". Things become more nebulous when you have connections that are failing due to strange network issues. These might be indicative of a health problem, but there's little that we can advise to do in those situations. Not sure if we track faults in connecting to other nodes in the transport layer anywhere.
I'm also not entirely sure if it's possible to determine a clean set of impacts for a cluster that is experiencing intermittent or permanent network partitioning other than to say "write availability for the cluster is degraded".
Elasticsearch Version
Version: 8.2.0, Build: default/tar/b174af62e8dd9f4ac4d25875e9381ffe2b9282c5/2022-04-20T10:35:10.180408517Z, JVM: 18
Installed Plugins
No response
Java Version
bundled
OS Version
21.5.0 Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:37 PDT 2022; root:xnu-8020.121.3~4/RELEASE_ARM64_T6000 arm64
Problem Description
This was originally observed on a cluster that was scaled from a single node to three nodes. If two data nodes aren't able to connect to one another, but are able to connect to the elected master node, we'll see confusing behavior:
The logs on the two segmented nodes help explain what's going on, but it'd be nice if this behavior could be avoided via safeguards or surfaced via health APIs in some way.
Steps to Reproduce
1.) Create self signed certificates for each node
2.) Configure 3x nodes so that 2x don't trust each other:
3.) Start the nodes and observe strange allocation issue:
Logs (if relevant)