Node to node connectivity issues aren't easily surfaced by APIs

n0othing commented 2 years ago

Elasticsearch Version

Version: 8.2.0, Build: default/tar/b174af62e8dd9f4ac4d25875e9381ffe2b9282c5/2022-04-20T10:35:10.180408517Z, JVM: 18

Installed Plugins

No response

Java Version

bundled

OS Version

21.5.0 Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:37 PDT 2022; root:xnu-8020.121.3~4/RELEASE_ARM64_T6000 arm64

Problem Description

This was originally observed on a cluster that was scaled from a single node to three nodes. If two data nodes aren't able to connect to one another, but are able to connect to the elected master node, we'll see confusing behavior:

Health APIs will suggest all nodes are present and accounted for.
Some shard allocations/recoveries appear stuck.

The logs on the two segmented nodes help explain what's going on, but it'd be nice if this behavior could be avoided via safeguards or surfaced via health APIs in some way.

Steps to Reproduce

1.) Create self signed certificates for each node

bin/elasticsearch-certutil cert --self-signed --name node-a --pem
bin/elasticsearch-certutil cert --self-signed --name node-b --pem
bin/elasticsearch-certutil cert --self-signed --name node-c --pem

2.) Configure 3x nodes so that 2x don't trust each other:

# node-a
node.name: node-a
http.port: 9200
transport.port: 9300
cluster.initial_master_nodes: ["127.0.0.1:9300"]
discovery.seed_hosts: ["127.0.0.1:9300", "127.0.0.1:9301", "127.0.0.1:9302" ]
ingest.geoip.downloader.enabled: false

xpack.security.enabled: true
xpack.license.self_generated.type: trial
xpack.security.http.ssl.enabled: false
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.key: "node-a.key"
xpack.security.transport.ssl.certificate: "node-a.crt"
xpack.security.transport.ssl.certificate_authorities: ["node-a.crt", "node-b.crt", "node-c.crt"]
xpack.security.transport.ssl.verification_mode: certificate

# node-b
node.name: node-b
http.port: 9201
transport.port: 9301
cluster.initial_master_nodes: ["127.0.0.1:9300"]
discovery.seed_hosts: ["127.0.0.1:9300", "127.0.0.1:9301", "127.0.0.1:9302" ]
ingest.geoip.downloader.enabled: false

xpack.security.enabled: true
xpack.license.self_generated.type: trial
xpack.security.http.ssl.enabled: false
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.key: "node-b.key"
xpack.security.transport.ssl.certificate: "node-b.crt"
xpack.security.transport.ssl.certificate_authorities: ["node-a.crt", "node-b.crt"]
xpack.security.transport.ssl.verification_mode: certificate

# node-c
node.name: node-c
http.port: 9202
transport.port: 9302
cluster.initial_master_nodes: ["127.0.0.1:9300"]
discovery.seed_hosts: ["127.0.0.1:9300", "127.0.0.1:9301", "127.0.0.1:9302" ]
ingest.geoip.downloader.enabled: false

xpack.security.enabled: true
xpack.license.self_generated.type: trial
xpack.security.http.ssl.enabled: false
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.key: "node-c.key"
xpack.security.transport.ssl.certificate: "node-c.crt"
xpack.security.transport.ssl.certificate_authorities: ["node-a.crt", "node-c.crt"]
xpack.security.transport.ssl.verification_mode: certificate

3.) Start the nodes and observe strange allocation issue:

GET _cluster/health

{
  "cluster_name" : "elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 9,
  "active_shards" : 15,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 1,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 83.33333333333334
}

GET _cat/nodes?v

ip        heap.percent ram.percent cpu load_1m load_5m load_15m node.role   master name
127.0.0.1            6         100  29    4.19                  cdfhilmrstw -      node-b
127.0.0.1           38         100  29    4.19                  cdfhilmrstw -      node-c
127.0.0.1           19         100  29    4.19                  cdfhilmrstw *      node-a

GET _cat/allocation?v

shards disk.indices disk.used disk.avail disk.total disk.percent host      ip        node
     6        2.7mb   100.1gb    826.1gb    926.3gb           10 127.0.0.1 127.0.0.1 node-a
     6        2.4mb   100.1gb    826.1gb    926.3gb           10 127.0.0.1 127.0.0.1 node-b
     5      325.9kb   100.1gb    826.1gb    926.3gb           10 127.0.0.1 127.0.0.1 node-c
     1                                                                               UNASSIGNED

GET _cat/indices?health=yellow&v&expand_wildcards=all

health status index                                                         uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   .kibana_task_manager_8.2.0_001                                mImiA0JIQKKocD_XBN37Tw   1   1         23          450    126.5kb        126.5kb
yellow open   .ds-.logs-deprecation.elasticsearch-default-2022.05.25-000001 IzxS3fADSVuQtb6XI9vW2Q   1   1          2            0     23.2kb         23.2kb
yellow open   .kibana_security_session_1                                    2HquvqlBQvauFS7dQH0r-Q   1   1          1            0      5.6kb          5.6kb

GET _cluster/allocation/explain
{
  "index": ".kibana_task_manager_8.2.0_001",
  "shard": 0,
  "primary": false
}

{
  "index" : ".kibana_task_manager_8.2.0_001",
  "shard" : 0,
  "primary" : false,
  "current_state" : "initializing",
  "unassigned_info" : {
    "reason" : "INDEX_CREATED",
    "at" : "2022-05-25T15:17:04.187Z",
    "last_allocation_status" : "no_attempt"
  },
  "current_node" : {
    "id" : "_cmcunZmSduawI_Kv4Uk6Q",
    "name" : "node-c",
    "transport_address" : "127.0.0.1:9302",
    "attributes" : {
      "ml.machine_memory" : "34359738368",
      "xpack.installed" : "true",
      "ml.max_jvm_size" : "4294967296"
    }
  },
  "explanation" : "the shard is in the process of initializing on node [node-c], wait until initialization has completed"
}

GET _cat/recovery?active_only&v&expand_wildcards=all

#! this request accesses system indices: [.apm-agent-configuration, .apm-custom-link, .kibana_8.2.0_001, .kibana_security_session_1, .kibana_task_manager_8.2.0_001, .security-7], but in a future major version, direct access to system indices will be prevented by default
index                          shard time  type stage source_host source_node target_host target_node repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
.kibana_security_session_1     0     11.1m peer index 127.0.0.1   node-b      127.0.0.1   node-c      n/a        n/a      0     0               0.0%          0           0b    0b              0.0%          0b          -1           0                      -1.0%
.kibana_task_manager_8.2.0_001 0     12.1m peer index 127.0.0.1   node-b      127.0.0.1   node-c      n/a        n/a      0     0               0.0%          0           0b    0b              0.0%          0b          -1           0                      -1.0%

Logs (if relevant)

[2022-05-25T12:07:02,616][INFO ][o.e.i.r.PeerRecoveryTargetService] [node-c] recovery of [.kibana_security_session_1][0] from [{node-b}{WEM6VTYhTmS40DAYMhk-Xg}{lg_1BGpmTx-N993vZ23OIQ}{127.0.0.1}{127.0.0.1:9301}{cdfhilmrstw}{ml.machine_memory=34359738368, ml.max_jvm_size=4294967296, xpack.installed=true}] interrupted by network disconnect, will retry in [5s]; cause: [[node-b][127.0.0.1:9301] Node not connected]
[2022-05-25T12:07:02,616][INFO ][o.e.i.r.PeerRecoveryTargetService] [node-c] recovery of [.kibana_task_manager_8.2.0_001][0] from [{node-b}{WEM6VTYhTmS40DAYMhk-Xg}{lg_1BGpmTx-N993vZ23OIQ}{127.0.0.1}{127.0.0.1:9301}{cdfhilmrstw}{ml.machine_memory=34359738368, ml.max_jvm_size=4294967296, xpack.installed=true}] interrupted by network disconnect, will retry in [5s]; cause: [[node-b][127.0.0.1:9301] Node not connected]
[2022-05-25T12:07:06,642][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [node-c] client did not trust this server's certificate, closing connection Netty4TcpChannel{localAddress=/127.0.0.1:9302, remoteAddress=/127.0.0.1:58322, profile=default}
[2022-05-25T12:07:06,643][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [node-c] client did not trust this server's certificate, closing connection Netty4TcpChannel{localAddress=/127.0.0.1:9302, remoteAddress=/127.0.0.1:58323, profile=default}
[2022-05-25T12:07:06,643][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [node-c] client did not trust this server's certificate, closing connection Netty4TcpChannel{localAddress=/127.0.0.1:9302, remoteAddress=/127.0.0.1:58318, profile=default}
[2022-05-25T12:07:06,643][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [node-c] client did not trust this server's certificate, closing connection Netty4TcpChannel{localAddress=/127.0.0.1:9302, remoteAddress=/127.0.0.1:58321, profile=default}
[2022-05-25T12:07:06,644][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [node-c] client did not trust this server's certificate, closing connection Netty4TcpChannel{localAddress=/127.0.0.1:9302, remoteAddress=/127.0.0.1:58319, profile=default}
[2022-05-25T12:07:06,644][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [node-c] client did not trust this server's certificate, closing connection Netty4TcpChannel{localAddress=/127.0.0.1:9302, remoteAddress=/127.0.0.1:58324, profile=default}
[2022-05-25T12:07:06,645][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [node-c] client did not trust this server's certificate, closing connection Netty4TcpChannel{localAddress=/127.0.0.1:9302, remoteAddress=/127.0.0.1:58317, profile=default}
[2022-05-25T12:07:06,645][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [node-c] client did not trust this server's certificate, closing connection Netty4TcpChannel{localAddress=/127.0.0.1:9302, remoteAddress=/127.0.0.1:58320, profile=default}
[2022-05-25T12:07:06,654][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [node-c] client did not trust this server's certificate, closing connection Netty4TcpChannel{localAddress=/127.0.0.1:9302, remoteAddress=/127.0.0.1:58315, profile=default}
[2022-05-25T12:07:06,658][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [node-c] client did not trust this server's certificate, closing connection Netty4TcpChannel{localAddress=/127.0.0.1:9302, remoteAddress=/127.0.0.1:58312, profile=default}
[2022-05-25T12:07:06,662][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [node-c] client did not trust this server's certificate, closing connection Netty4TcpChannel{localAddress=/127.0.0.1:9302, remoteAddress=/127.0.0.1:58316, profile=default}
[2022-05-25T12:07:07,633][INFO ][o.e.i.r.PeerRecoveryTargetService] [node-c] recovery of [.kibana_security_session_1][0] from [{node-b}{WEM6VTYhTmS40DAYMhk-Xg}{lg_1BGpmTx-N993vZ23OIQ}{127.0.0.1}{127.0.0.1:9301}{cdfhilmrstw}{ml.machine_memory=34359738368, ml.max_jvm_size=4294967296, xpack.installed=true}] interrupted by network disconnect, will retry in [5s]; cause: [[node-b][127.0.0.1:9301] Node not connected]
[2022-05-25T12:07:07,634][INFO ][o.e.i.r.PeerRecoveryTargetService] [node-c] recovery of [.kibana_task_manager_8.2.0_001][0] from [{node-b}{WEM6VTYhTmS40DAYMhk-Xg}{lg_1BGpmTx-N993vZ23OIQ}{127.0.0.1}{127.0.0.1:9301}{cdfhilmrstw}{ml.machine_memory=34359738368, ml.max_jvm_size=4294967296, xpack.installed=true}] interrupted by network disconnect, will retry in [5s]; cause: [[node-b][127.0.0.1:9301] Node not connected]
[2022-05-25T12:07:09,265][WARN ][o.e.c.s.DiagnosticTrustManager] [node-c] failed to establish trust with server at [<unknown host>]; the server provided a certificate with subject name [CN=node-b], fingerprint [5e9bc14de6fe16ba24946e61ada6be0af44ff385], no keyUsage and no extendedKeyUsage; the session uses cipher suite [TLS_AES_256_GCM_SHA384] and protocol [TLSv1.3]; the certificate does not have any subject alternative names; the certificate is self-issued; the [CN=node-b] certificate is not trusted in this ssl context ([xpack.security.transport.ssl (with trust configuration: PEM-trust{/Users/robbie/elastic/8.2.0_replication_issue/node-c-data/config/node-a.crt,/Users/robbie/elastic/8.2.0_replication_issue/node-c-data/config/node-c.crt})])
sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

elasticmachine commented 2 years ago

Pinging @elastic/es-data-management (Team:Data Management)

jbaiera commented 2 years ago

This could be an interesting indicator for the new health API. Eventually the API will be able to report on master node connectivity issues (among other things). I could see there being a node-to-node-connectivity indicator of some sort that ensures that transport connections to all other nodes are functional and remain so over X period of time.

In order to make a case for this in the health api, any problems that we check for should ideally be resolvable with advice that can be produced from within Elasticsearch. The certificate problem you mention is a good example: "Fix your trust settings, here's a general troubleshooting guide". Things become more nebulous when you have connections that are failing due to strange network issues. These might be indicative of a health problem, but there's little that we can advise to do in those situations. Not sure if we track faults in connecting to other nodes in the transport layer anywhere.

I'm also not entirely sure if it's possible to determine a clean set of impacts for a cluster that is experiencing intermittent or permanent network partitioning other than to say "write availability for the cluster is degraded".

elastic / elasticsearch