canonical / opensearch-operator

OpenSearch operator
Apache License 2.0
9 stars 5 forks source link

non-leader instances won't settle to active/idle state #334

Closed jeffreychang911 closed 4 days ago

jeffreychang911 commented 1 week ago

Steps to reproduce

  1. Deploy 3 opensearch nodes, and relate self-signed-certificates on top of MAAS cloud.

Expected behavior

opensearch nodes should settle and reach active/idle state after a while.

Actual behavior

Leader instance will settle shortly, but non-leaders get "Requesting lock on operation: start" & "Waiting for OpenSearch to start...". status stuck at "waiting" for hours.

Versions

Operating system:

Juju CLI: 3.5.1

Juju agent: 3.5.1

Charm revision: 2/edge rev 102.

LXD:

Log output

Juju debug log:

the only error I can find unit-opensearch-2: 19:27:35 ERROR unit.opensearch/2.juju-log [Errno 111] Connection refused unit-opensearch-2: 19:27:35 ERROR unit.opensearch/2.juju-log [Errno 111] Connection refused unit-opensearch-1: 19:27:36 ERROR unit.opensearch/1.juju-log node-lock-fallback:0: [Errno 111] Connection refused unit-opensearch-0: 19:29:47 ERROR unit.opensearch/0.juju-log opensearch-peers:1: [Errno 111] Connection refused unit-opensearch-0: 19:29:47 ERROR unit.opensearch/0.juju-log opensearch-peers:1: Cannot connect to the OpenSearch server... unit-opensearch-2: 19:29:47 ERROR unit.opensearch/2.juju-log opensearch-peers:1: [Errno 111] Connection refused

Additional context

opensearch.debug.log.gz

github-actions[bot] commented 1 week ago

https://warthogs.atlassian.net/browse/DPE-4729

phvalguima commented 1 week ago

Hi @jeffreychang911, at first glance, nodes are getting the wrong IPs to try form the cluster. Although they form the opensearch.yml correctly:

node.name: opensearch-1
network.host:
- _site_
- node1.silo1.lab0.solutionsqa
- 10.244.40.205

Here is the snippet of one of the units:

Jun 25 09:42:22 node1 opensearch.daemon[13265]: [2024-06-25T09:42:22,904][INFO ][o.o.s.c.ConfigurationRepository] [opensearch-1] Wait for cluster to be available ...
Jun 25 09:42:23 node1 opensearch.daemon[13265]: [2024-06-25T09:42:23,254][WARN ][o.o.t.OutboundHandler    ] [opensearch-1] send message failed [channel: Netty4TcpChannel{localAddress=/10.244.8.3:48868, remoteAddress=ens4.2678.node2.silo1.lab0.solutionsqa/10.244.8.4:9300}]
Jun 25 09:42:23 node1 opensearch.daemon[13265]: javax.net.ssl.SSLHandshakeException: No subject alternative DNS name matching ens4.2678.node2.silo1.lab0.solutionsqa found.
Jun 25 09:42:23 node1 opensearch.daemon[13265]:         at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:130) ~[?:?]
Jun 25 09:42:23 node1 opensearch.daemon[13265]:         at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:378) ~[?:?]
Jun 25 09:42:23 node1 opensearch.daemon[13265]:         at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:321) ~[?:?]
Jun 25 09:42:23 node1 opensearch.daemon[13265]:         at

...
phvalguima commented 1 week ago

After some further digging, I believe the issue is due to a "random" selection that happens with OpenSearch, coupled with the fact that our charm is not really picky about which interface to bind its HTTP servers (transport, :9300, and service itself, :9200).

This is the current juju status:

Model       Controller        Cloud/Region        Version  SLA          Timestamp
opensearch  foundations-maas  maas_cloud/default  3.5.1    unsupported  10:35:03Z

App                       Version  Status   Scale  Charm                     Channel        Rev  Exposed  Message
opensearch                         waiting      3  opensearch                2/edge         102  no       Requesting lock on operation: start
self-signed-certificates           active       1  self-signed-certificates  latest/stable   72  no       

Unit                         Workload  Agent  Machine  Public address  Ports     Message
opensearch/0                 waiting   idle   0        10.244.40.204             Requesting lock on operation: start
opensearch/1                 waiting   idle   1        10.244.40.205             Waiting for OpenSearch to start...
opensearch/2*                active    idle   2        10.244.40.206   9200/tcp  
self-signed-certificates/0*  active    idle   3        10.244.40.207             

Machine  State    Address        Inst id  Base          AZ     Message
0        started  10.244.40.204  node3    ubuntu@22.04  zone1  Deployed
1        started  10.244.40.205  node1    ubuntu@22.04  zone2  Deployed
2        started  10.244.40.206  node2    ubuntu@22.04  zone3  Deployed
3        started  10.244.40.207  node5    ubuntu@22.04  zone3  Deployed

So, OpenSearch should be binding its HTTP services to interfaces using 10.244.40.0/24 subnet.

However, on node1, we can see:

# Expected, as this node is not charm leader and the cluster is still in its starting stages
ubuntu@node1:~$ curl -sk -u admin:<pwd> https://10.244.40.205:9200/_nodes 
OpenSearch Security not initialized.

# This node is "seeing" the wrong IP for its cluster manager, "node2"
ubuntu@node1:~$ curl -sk -u admin:<pwd> https://10.244.40.206:9200/_nodes | jq .nodes[].ip
"10.244.8.4"

Looking further into node1, we can see:

$ sudo cat /var/snap/opensearch/51/etc/opensearch/unicast_hosts.txt
ens4.2678.node2.silo1.lab0.solutionsqa
10.244.8.4

$ curl -sk -u admin:<pwd> https://10.244.40.206:9200/_nodes | jq . | head -n 30
{
  "_nodes": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "cluster_name": "opensearch-k8nj",
  "nodes": {
    "pQrgVMinT7GnMMah6nqq-A": {
      "name": "opensearch-2",
      "transport_address": "10.244.8.4:9300",   <<<<<----------------------------------------
      "host": "10.244.8.4",
      "ip": "10.244.8.4",    <<<<<----------------------------------------
      "version": "2.14.0",
      "build_type": "tar",
      "build_hash": "30dd870855093c9dca23fc6f8cfd5c0d7c83127d",
      "total_indexing_buffer": 107374182,
      "roles": [
        "cluster_manager",
        "coordinating_only",
        "data",
        "ingest",
        "ml"
      ],
...

So, the question is why node2 is registering itself with the wrong IP.

phvalguima commented 1 week ago

The node2 has the following opensearch.yml configuration:

cluster.name: opensearch-k8nj
node.name: opensearch-2
network.host:
- _site_
- node2.silo1.lab0.solutionsqa
- 10.244.40.206

That means, the very first IP it picks is _site_. Indeed, looking at its own logs, we can see:

/var/log/syslog.1:Jun 21 19:26:10 node2 opensearch.daemon[10760]: [2024-06-21T19:26:10,322][INFO ][o.o.d.PeerFinder         ] [opensearch-2] setting findPeersInterval to [1s] as node commission status = [true] for local node [{opensearch-2}{pQrgVMinT7GnMMah6nqq-A}{nbdGPdVbQ9yLsGUdM4J5Fw}{10.244.8.4}{10.244.8.4:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true}]

So, OpenSearch itself selected, out of all its local IPs available, to bind to 10.244.8.4.

The bind happens in this line, where it requests NetworkService to give a suitable IP.

That method ends up by gathering all the selected IPs in a list, sorting it and then returning its first value.

The actual search by local IP is implemented here, where it discovers all the IPs that are "local" to the said cluster (i.e., private ranges).

That means, although we are passing IPs from 10.244.40.0/24 across Juju relations, the cluster itself is selecting the 10.244.8.0/24 as its default subnet. Now, that means SANs will not match the available transport IPs and, hence, fail with "certificate_unknown" error.

phvalguima commented 1 week ago

Action Plan

I propose we move away from _site_ in network.host configuration. We can either remove it entirely and only rely on the IP and hostname set by the charm; or use the _[network interface]_ instead.

Now, using the network interface has two main extra things to consider: (1) it means we need to discover which network interface we are interested on, based on the IP ranges coming from juju spaces; and (2) a network interface may have more than one IP, including VIPs, that may be selected instead. In the case of a VIP, this is even more risky, as VIPs can bounce between different units.

One example, we could have a single interface with: its IP: 192.168.0.200, and a VIP that has been (temp.) assigned to this same node: 192.168.0.10. The VIP will be sorted first, and hence selected. That may also happen between IPv4 and IPv6.

phvalguima commented 1 week ago

I've discussed this issue with @Mehdi-Bendriss and @reneradoi. The solution we will adopt is to: (1) have certificates being generated with SANs for all the IPs available; and (2) we will use the network.host pointing to the specific relation binding's IP. For the large vs. small deployments, we will have to check the bindings match, for (2).