canonical / opensearch-operator

OpenSearch operator
Apache License 2.0
12 stars 7 forks source link

Cluster re-establishment after reboot #416

Closed juditnovak closed 2 months ago

juditnovak commented 3 months ago

After rebooting the host (multipass in my case but it shouldn't matter) the service is unavailable.

Screenshot from 2024-08-27 16-30-43

Steps to reproduce

  1. Deploy opensearch (rev 137)
  2. reboot the host

Expected behavior

Cluster is active.

Actual behavior

See above.

Versions

image

LXD: 5.0.3

Log output

Juju debug log:

Not much in here.

unit-opensearch-1: 16:33:42 ERROR unit.opensearch/1.juju-log [Errno 111] Connection refused
unit-opensearch-1: 16:33:42 ERROR unit.opensearch/1.juju-log Cannot connect to the OpenSearch server...
unit-opensearch-1: 16:33:42 ERROR unit.opensearch/1.juju-log [Errno 111] Connection refused
unit-opensearch-1: 16:33:42 ERROR unit.opensearch/1.juju-log [Errno 111] Connection refused
unit-opensearch-1: 16:33:42 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-opensearch-0: 16:35:03 INFO juju.worker.uniter.operation ran "opensearch-client-relation-departed" hook (via hook dispatching script: dispatch)
unit-opensearch-0: 16:35:03 INFO juju.worker.uniter.runner executing opensearch-client-relation-broken via debug-code; hook dispatching script: dispatch
unit-opensearch-0: 16:35:04 INFO unit.opensearch/0.juju-log opensearch-client:9: debug running /var/lib/juju/agents/unit-opensearch-0/charm/dispatch for opensearch-client-relation-broken
unit-opensearch-0: 16:35:06 WARNING unit.opensearch/0.juju-log opensearch-client:9: 'app' expected but not received.
unit-opensearch-0: 16:35:06 WARNING unit.opensearch/0.juju-log opensearch-client:9: 'app_name' expected in snapshot but not found.
unit-opensearch-0: 16:35:08 INFO juju.worker.uniter.operation ran "opensearch-client-relation-broken" hook (via hook dispatching script: dispatch)
unit-opensearch-0: 16:36:45 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-self-signed-certificates-0: 16:37:44 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-opensearch-1: 16:37:45 ERROR unit.opensearch/1.juju-log [Errno 111] Connection refused
unit-opensearch-1: 16:37:45 ERROR unit.opensearch/1.juju-log Cannot connect to the OpenSearch server...
unit-opensearch-1: 16:37:45 ERROR unit.opensearch/1.juju-log [Errno 111] Connection refused
unit-opensearch-1: 16:37:45 ERROR unit.opensearch/1.juju-log [Errno 111] Connection refused
unit-opensearch-1: 16:37:46 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-opensearch-0: 16:41:37 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-self-signed-certificates-0: 16:42:09 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-opensearch-1: 16:42:13 ERROR unit.opensearch/1.juju-log [Errno 111] Connection refused
unit-opensearch-1: 16:42:13 ERROR unit.opensearch/1.juju-log Cannot connect to the OpenSearch server...
unit-opensearch-1: 16:42:13 ERROR unit.opensearch/1.juju-log [Errno 111] Connection refused
unit-opensearch-1: 16:42:13 ERROR unit.opensearch/1.juju-log [Errno 111] Connection refused
unit-opensearch-1: 16:42:13 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-opensearch-0: 16:46:48 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-opensearch-1: 16:47:19 ERROR unit.opensearch/1.juju-log [Errno 111] Connection refused
unit-opensearch-1: 16:47:19 ERROR unit.opensearch/1.juju-log Cannot connect to the OpenSearch server...
unit-opensearch-1: 16:47:19 ERROR unit.opensearch/1.juju-log [Errno 111] Connection refused
unit-opensearch-1: 16:47:19 ERROR unit.opensearch/1.juju-log [Errno 111] Connection refused
unit-opensearch-1: 16:47:20 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)

However in the appliction logs (/var/snap/opensearch/common/var/log/opensearch/opensearch-nfp7.log) the following keeps repeating:

[2024-08-27T14:38:29,274][INFO ][o.o.s.c.ConfigurationRepository] [opensearch-0.e74] Wait for cluster to be available ...
[2024-08-27T14:38:30,029][WARN ][o.o.c.c.ClusterFormationFailureHelper] [opensearch-0.e74] cluster-manager not discovered or elected yet, an election requires a node with id [yTLtw5wNQlCsHUcrKaU5Kw], have discovered [{opensearch-0.e74}{5dpyzZhsRqKpW9ybb3r2Gg}{wWmsOQKxSMu8psq_FOpJPQ}{10.8.62.204}{10.8.62.204:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=617e5f02-5be5-4e25-85f0-276b2347a5ad/opensearch}] which is not a quorum; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, 127.0.0.1:9300] from hosts providers and [{opensearch-0.e74}{5dpyzZhsRqKpW9ybb3r2Gg}{wWmsOQKxSMu8psq_FOpJPQ}{10.8.62.204}{10.8.62.204:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=617e5f02-5be5-4e25-85f0-276b2347a5ad/opensearch}] from last-known cluster state; node term 1, last-accepted version 49 in term 1

Full application logs: logs.txt Journalctl: journalctl.txt

syncronize-issues-to-jira[bot] commented 3 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-5274.

This message was autogenerated

juditnovak commented 2 months ago

@phvalguima Could you pls confirm if https://github.com/canonical/opensearch-operator/pull/405 (Node exclusions fix) is the one to address this issue -- or rather https://github.com/canonical/opensearch-operator/pull/407 (Full restart)

My suspect about quorum here is rather the former, however I wonder --based on the error message-- if it may be a node discovery issue here instead? (Sry, I'm low on context on this area but I'd be interested to learn more detials :-) )

phvalguima commented 2 months ago

407 should fix this issue