Closed jeffreychang911 closed 4 days ago
Hi @jeffreychang911, at first glance, nodes are getting the wrong IPs to try form the cluster. Although they form the opensearch.yml correctly:
node.name: opensearch-1
network.host:
- _site_
- node1.silo1.lab0.solutionsqa
- 10.244.40.205
Here is the snippet of one of the units:
Jun 25 09:42:22 node1 opensearch.daemon[13265]: [2024-06-25T09:42:22,904][INFO ][o.o.s.c.ConfigurationRepository] [opensearch-1] Wait for cluster to be available ...
Jun 25 09:42:23 node1 opensearch.daemon[13265]: [2024-06-25T09:42:23,254][WARN ][o.o.t.OutboundHandler ] [opensearch-1] send message failed [channel: Netty4TcpChannel{localAddress=/10.244.8.3:48868, remoteAddress=ens4.2678.node2.silo1.lab0.solutionsqa/10.244.8.4:9300}]
Jun 25 09:42:23 node1 opensearch.daemon[13265]: javax.net.ssl.SSLHandshakeException: No subject alternative DNS name matching ens4.2678.node2.silo1.lab0.solutionsqa found.
Jun 25 09:42:23 node1 opensearch.daemon[13265]: at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:130) ~[?:?]
Jun 25 09:42:23 node1 opensearch.daemon[13265]: at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:378) ~[?:?]
Jun 25 09:42:23 node1 opensearch.daemon[13265]: at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:321) ~[?:?]
Jun 25 09:42:23 node1 opensearch.daemon[13265]: at
...
After some further digging, I believe the issue is due to a "random" selection that happens with OpenSearch, coupled with the fact that our charm is not really picky about which interface to bind its HTTP servers (transport, :9300
, and service itself, :9200
).
This is the current juju status
:
Model Controller Cloud/Region Version SLA Timestamp
opensearch foundations-maas maas_cloud/default 3.5.1 unsupported 10:35:03Z
App Version Status Scale Charm Channel Rev Exposed Message
opensearch waiting 3 opensearch 2/edge 102 no Requesting lock on operation: start
self-signed-certificates active 1 self-signed-certificates latest/stable 72 no
Unit Workload Agent Machine Public address Ports Message
opensearch/0 waiting idle 0 10.244.40.204 Requesting lock on operation: start
opensearch/1 waiting idle 1 10.244.40.205 Waiting for OpenSearch to start...
opensearch/2* active idle 2 10.244.40.206 9200/tcp
self-signed-certificates/0* active idle 3 10.244.40.207
Machine State Address Inst id Base AZ Message
0 started 10.244.40.204 node3 ubuntu@22.04 zone1 Deployed
1 started 10.244.40.205 node1 ubuntu@22.04 zone2 Deployed
2 started 10.244.40.206 node2 ubuntu@22.04 zone3 Deployed
3 started 10.244.40.207 node5 ubuntu@22.04 zone3 Deployed
So, OpenSearch should be binding its HTTP services to interfaces using 10.244.40.0/24 subnet.
However, on node1
, we can see:
# Expected, as this node is not charm leader and the cluster is still in its starting stages
ubuntu@node1:~$ curl -sk -u admin:<pwd> https://10.244.40.205:9200/_nodes
OpenSearch Security not initialized.
# This node is "seeing" the wrong IP for its cluster manager, "node2"
ubuntu@node1:~$ curl -sk -u admin:<pwd> https://10.244.40.206:9200/_nodes | jq .nodes[].ip
"10.244.8.4"
Looking further into node1
, we can see:
$ sudo cat /var/snap/opensearch/51/etc/opensearch/unicast_hosts.txt
ens4.2678.node2.silo1.lab0.solutionsqa
10.244.8.4
$ curl -sk -u admin:<pwd> https://10.244.40.206:9200/_nodes | jq . | head -n 30
{
"_nodes": {
"total": 1,
"successful": 1,
"failed": 0
},
"cluster_name": "opensearch-k8nj",
"nodes": {
"pQrgVMinT7GnMMah6nqq-A": {
"name": "opensearch-2",
"transport_address": "10.244.8.4:9300", <<<<<----------------------------------------
"host": "10.244.8.4",
"ip": "10.244.8.4", <<<<<----------------------------------------
"version": "2.14.0",
"build_type": "tar",
"build_hash": "30dd870855093c9dca23fc6f8cfd5c0d7c83127d",
"total_indexing_buffer": 107374182,
"roles": [
"cluster_manager",
"coordinating_only",
"data",
"ingest",
"ml"
],
...
So, the question is why node2
is registering itself with the wrong IP.
The node2
has the following opensearch.yml
configuration:
cluster.name: opensearch-k8nj
node.name: opensearch-2
network.host:
- _site_
- node2.silo1.lab0.solutionsqa
- 10.244.40.206
That means, the very first IP it picks is _site_
. Indeed, looking at its own logs, we can see:
/var/log/syslog.1:Jun 21 19:26:10 node2 opensearch.daemon[10760]: [2024-06-21T19:26:10,322][INFO ][o.o.d.PeerFinder ] [opensearch-2] setting findPeersInterval to [1s] as node commission status = [true] for local node [{opensearch-2}{pQrgVMinT7GnMMah6nqq-A}{nbdGPdVbQ9yLsGUdM4J5Fw}{10.244.8.4}{10.244.8.4:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true}]
So, OpenSearch itself selected, out of all its local IPs available, to bind to 10.244.8.4.
The bind happens in this line, where it requests NetworkService
to give a suitable IP.
That method ends up by gathering all the selected IPs in a list, sorting it and then returning its first value.
The actual search by local IP is implemented here, where it discovers all the IPs that are "local" to the said cluster (i.e., private ranges).
That means, although we are passing IPs from 10.244.40.0/24 across Juju relations, the cluster itself is selecting the 10.244.8.0/24 as its default subnet. Now, that means SANs will not match the available transport IPs and, hence, fail with "certificate_unknown" error.
I propose we move away from _site_
in network.host
configuration. We can either remove it entirely and only rely on the IP and hostname set by the charm; or use the _[network interface]_
instead.
Now, using the network interface has two main extra things to consider: (1) it means we need to discover which network interface we are interested on, based on the IP ranges coming from juju spaces
; and (2) a network interface may have more than one IP, including VIPs, that may be selected instead. In the case of a VIP, this is even more risky, as VIPs can bounce between different units.
One example, we could have a single interface with: its IP: 192.168.0.200
, and a VIP that has been (temp.) assigned to this same node: 192.168.0.10
. The VIP will be sorted first, and hence selected. That may also happen between IPv4 and IPv6.
I've discussed this issue with @Mehdi-Bendriss and @reneradoi. The solution we will adopt is to: (1) have certificates being generated with SANs for all the IPs available; and (2) we will use the network.host
pointing to the specific relation binding's IP. For the large vs. small deployments, we will have to check the bindings match, for (2).
Steps to reproduce
Expected behavior
opensearch nodes should settle and reach active/idle state after a while.
Actual behavior
Leader instance will settle shortly, but non-leaders get "Requesting lock on operation: start" & "Waiting for OpenSearch to start...". status stuck at "waiting" for hours.
Versions
Operating system:
Juju CLI: 3.5.1
Juju agent: 3.5.1
Charm revision: 2/edge rev 102.
LXD:
Log output
Juju debug log:
the only error I can find unit-opensearch-2: 19:27:35 ERROR unit.opensearch/2.juju-log [Errno 111] Connection refused unit-opensearch-2: 19:27:35 ERROR unit.opensearch/2.juju-log [Errno 111] Connection refused unit-opensearch-1: 19:27:36 ERROR unit.opensearch/1.juju-log node-lock-fallback:0: [Errno 111] Connection refused unit-opensearch-0: 19:29:47 ERROR unit.opensearch/0.juju-log opensearch-peers:1: [Errno 111] Connection refused unit-opensearch-0: 19:29:47 ERROR unit.opensearch/0.juju-log opensearch-peers:1: Cannot connect to the OpenSearch server... unit-opensearch-2: 19:29:47 ERROR unit.opensearch/2.juju-log opensearch-peers:1: [Errno 111] Connection refused
Additional context
opensearch.debug.log.gz