canonical / opensearch-operator

OpenSearch operator
Apache License 2.0
11 stars 6 forks source link

Charm fails to bootstrap after host machine reboot #325

Closed Mehdi-Bendriss closed 2 weeks ago

Mehdi-Bendriss commented 3 months ago

When a the LXD host machine reboots the charm fails to get the opensearch service to run, raising 503 http errors in the process.

Server logs:

[2024-06-10T09:06:16,978][INFO ][o.o.s.c.ConfigurationRepository] [main-0_711] Wait for cluster to be available ...
[2024-06-10T09:06:17,208][WARN ][o.o.c.c.ClusterFormationFailureHelper] [main-0_711] cluster-manager not discovered or elected yet, an election requires at least 3 nodes with ids from [poxiRShkTxeXfn_Quoh7-Q, 7YwoLqLUSCqYXqEqSzc1BA, rNb9cX69ThSrIY8sUB0G8A, xL8-9yGaSRWsI80bZAV2FA, tkNsDbIHRYWGnfIV2y-d2A], have discovered [{main-0_711}{7YwoLqLUSCqYXqEqSzc1BA}{vrNYjytWTjqkRdwnrmzfXw}{10.122.32.75}{10.122.32.75:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=4cac7b7d-58b3-4c18-80c7-8f14aaeaeff7/main}] which is not a quorum; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, 127.0.0.1:9300] from hosts providers and [{main-0_711}{7YwoLqLUSCqYXqEqSzc1BA}{vrNYjytWTjqkRdwnrmzfXw}{10.122.32.75}{10.122.32.75:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=4cac7b7d-58b3-4c18-80c7-8f14aaeaeff7/main}] from last-known cluster state; node term 1, last-accepted version 86 in term 1
[2024-06-10T09:06:17,673][ERROR][o.o.s.a.BackendRegistry  ] [main-0_711] Not yet initialized (you may need to run securityadmin)
[2024-06-10T09:06:17,880][ERROR][o.o.s.a.BackendRegistry  ] [main-0_711] Not yet initialized (you may need to run securityadmin

This is most likely due to the unicast_hosts.txt and opensearch.yml (initial_cluster_manager) not being properly cleaned / removing all its content (offline cm eligible nodes) before starting the start process.

github-actions[bot] commented 3 months ago

https://warthogs.atlassian.net/browse/DPE-4588

phvalguima commented 3 months ago

I can reproduce this issue with: 1) Deploy an LXD cluster 2) Restart the host machine after the cluster has settled

Once the machine is back, the cluster cannot settle anymore. A systemd status on each of the nodes shows: https://pastebin.ubuntu.com/p/MMFvSVvNZx/

One of the 3x units are up and cannot find the cluster manager anymore:

Jul 02 12:44:44 juju-c52a5b-0 opensearch.daemon[1076]: [2024-07-02T12:44:44,159][WARN ][o.o.c.c.ClusterFormationFailureHelper] [opensearch-0] cluster-manager not discovered or elected yet, an election requires at least 2 nodes with ids from [oTRIYXJ-QQ61M9fKQpfw5Q, 97OSchtyRwq-9rgd_zfoiw, OLvEl2U6TRa8WoXmGAIuoQ], have discovered [{opensearch-0}{OLvEl2U6TRa8WoXmGAIuoQ}{_F2PYiagQ6qQ-oZplrwBZg}{10.175.87.40}{10.175.87.40:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true}] which is not a quorum; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, 127.0.0.1:9300] from hosts providers and [{opensearch-0}{OLvEl2U6TRa8WoXmGAIuoQ}{_F2PYiagQ6qQ-oZplrwBZg}{10.175.87.40}{10.175.87.40:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 50 in term 1

That means, the opensearch/0 restarted because it was the leader charm unit, but given it is not the cluster manager leader, it will not let the other units start as they cannot grab the lock anymore.

phvalguima commented 2 months ago

I reran the same test and indeed I am seeing the same problem once again. However, manually restarting the remaining CM nodes that were powered down brings the cluster back.

I also tried the following scenarios:

Therefore, I can see we can bring nodes up in any other without not much damage to the cluster in this scenario.

When we restart a node, the entire start / config-changed hook sequence will re-run.

I propose the following: At start hook, we check: if already had the opensearch snap installed and the peer relation is set with a started flag -> start the service

This way, all the eligible CMs will come back online and eventually the cluster will form again.

The other option we have is to set the systemd service as "enabled", which means opensearch will automatically restart after reboot. I believe this option is preferable.

phvalguima commented 2 months ago

CONT

Now, the main challenge in my view of a full-restart as suggested, is the risk of split-brains at certain moments of the restart.

Split brain - Test 1 - 6-node opensearch - half-cluster cut off

Deployed 6x nodes opensearch and: 1) Stopped the last 3x units (not-elected leaders) 2) Inserted a 6-replica index and pushed 100 docs 3) Stopped the first 3x units (with the elected leader) 4) Restarted the last 3x units

The last 3x units have a valid unicast_hosts.txt file and can find each other. However, each unit complains of missing an extra known unit to restart the cluster:

Jul 16 11:25:13 juju-a56f65-3 opensearch.daemon[19725]: [2024-07-16T11:25:13,727][WARN ][o.o.c.c.ClusterFormationFailureHelper] [opensearch-3.ddf] cluster-manager not discovered or elected yet, an election requires at least 3 nodes with ids from [JhXcRAMmSGW5ElV6ixjkZQ, VeBf6GTdQbCKUzio85L8Cg, qv5k5erhRvm6TOQUYF5uPA, HzYBd9PPSk28AyU4gO63jQ, qyAQDXL6SbOrJ9PEp9FTWw], have discovered [{opensearch-3.ddf}{JhXcRAMmSGW5ElV6ixjkZQ}{ijs2FK6ZQCO8C3X1g6RXAg}{10.225.137.32}{10.225.137.32:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=fdfaa772-90ee-4e84-8e56-f06d0ea56f65/opensearch}, {opensearch-4.ddf}{qv5k5erhRvm6TOQUYF5uPA}{yRRlrZW-TTO8r0vlJ2lAmw}{10.225.137.85}{10.225.137.85:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=fdfaa772-90ee-4e84-8e56-f06d0ea56f65/opensearch}, {opensearch-5.ddf}{xb1RNEmURCqnjxJj9_zT-A}{q9m-eMYLQQGhF_Af1FiDjQ}{10.225.137.188}{10.225.137.188:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=fdfaa772-90ee-4e84-8e56-f06d0ea56f65/opensearch}] which is not a quorum; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, 10.225.137.203:9300, 10.225.137.155:9300, 10.225.137.85:9300, 10.225.137.182:9300, 10.225.137.188:9300] from hosts providers and [{opensearch-3.ddf}{JhXcRAMmSGW5ElV6ixjkZQ}{ijs2FK6ZQCO8C3X1g6RXAg}{10.225.137.32}{10.225.137.32:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=fdfaa772-90ee-4e84-8e56-f06d0ea56f65/opensearch}] from last-known cluster state; node term 1, last-accepted version 78 in term 1

Bringing an extra unit

The new unit, from the first 3x nodes, cannot connect to the cluster, as it does not recognize the last 3x opensearch units as having valid metadata:

Jul 16 11:30:27 juju-a56f65-1 opensearch.daemon[21522]: [2024-07-16T11:30:27,500][WARN ][o.o.c.c.ClusterFormationFailureHelper] [opensearch-1.ddf] cluster-manager not discovered or elected yet, an election requires at least 2 nodes with ids from [VeBf6GTdQbCKUzio85L8Cg, HzYBd9PPSk28AyU4gO63jQ, qyAQDXL6SbOr
J9PEp9FTWw], have discovered [{opensearch-1.ddf}{HzYBd9PPSk28AyU4gO63jQ}{TJAI0urnQUOMfrSf_bqQFQ}{10.225.137.155}{10.225.137.155:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=fdfaa772-90ee-4e84-8e56-f06d0ea56f65/opensearch}, {opensearch-3.ddf}{JhXcRAMmSGW5ElV6ixjkZQ}{ijs2FK6ZQCO
8C3X1g6RXAg}{10.225.137.32}{10.225.137.32:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=fdfaa772-90ee-4e84-8e56-f06d0ea56f65/opensearch}, {opensearch-4.ddf}{qv5k5erhRvm6TOQUYF5uPA}{yRRlrZW-TTO8r0vlJ2lAmw}{10.225.137.85}{10.225.137.85:9300}{coordinating_onlydimml}{shard_indexing
_pressure_enabled=true, app_id=fdfaa772-90ee-4e84-8e56-f06d0ea56f65/opensearch}, {opensearch-5.ddf}{xb1RNEmURCqnjxJj9_zT-A}{q9m-eMYLQQGhF_Af1FiDjQ}{10.225.137.188}{10.225.137.188:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=fdfaa772-90ee-4e84-8e56-f06d0ea56f65/opensearch}] whi
ch is not a quorum; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, 10.225.137.182:9300, 10.225.137.203:9300, 10.225.137.32:9300, 10.225.137.85:9300, 10.225
.137.188:9300] from hosts providers and [{opensearch-1.ddf}{HzYBd9PPSk28AyU4gO63jQ}{TJAI0urnQUOMfrSf_bqQFQ}{10.225.137.155}{10.225.137.155:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=fdfaa772-90ee-4e84-8e56-f06d0ea56f65/opensearch}] from last-known cluster state; node term 2,
 last-accepted version 121 in term 2

It is expecting a subset of the 6x nodes, although they are all eligible cluster managers.

Restarting all nodes

Restarting all nodes eventually brings the cluster back to a healthy status and all 6x nodes are shown in /_cat/nodes. The opensearch/1 came up as the elected manager now.

Split brain - Test 2 - 5-node opensearch + adding an extra node after

I ran another test, this time targeting a split brain scenario, which may happen in full power-off / full power-on:

Deploying 5x nodes opensearch units, I've:

The cluster, as expected, is not coming up due to missing units:

Jul 16 10:58:20 juju-354fd8-3 opensearch.daemon[24528]: [2024-07-16T10:58:20,139][WARN ][o.o.c.c.ClusterFormationFailureHelper] [opensearch-3.d6c] cluster-manager not discovered or elected yet, an election requires at least 3 nodes with ids from [41bzeg-yRSq47P6qDP9R5w, YDjXVrHLR425K3pB2OreWg, qpf4jqIvRpWq
FZzVXP0vnQ, B1yc-Fk0TDu9DTAQ0vXZ0g, ygFPNW0VTfmk_5AXAekfPQ], have discovered [{opensearch-3.d6c}{ygFPNW0VTfmk_5AXAekfPQ}{xROcg9LjScWc96IsPXoilg}{10.225.137.179}{10.225.137.179:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=a018c111-55b7-446a-89b2-1a014a354fd8/opensearch}, {opens
earch-4.d6c}{qpf4jqIvRpWqFZzVXP0vnQ}{7UcFZnjwTt-6yfKKYzPEEg}{10.225.137.56}{10.225.137.56:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=a018c111-55b7-446a-89b2-1a014a354fd8/opensearch}] which is not a quorum; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.
0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, 10.225.137.26:9300, 10.225.137.56:9300, 10.225.137.45:9300, 10.225.137.242:9300] from hosts providers and [{opensearch-3.d6c}{ygFPNW0VTfmk_5AXAekfPQ}{xROcg9LjScWc96IsPXoilg}{10.
225.137.179}{10.225.137.179:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=a018c111-55b7-446a-89b2-1a014a354fd8/opensearch}] from last-known cluster state; node term 1, last-accepted version 122 in term 1

It is not possible to add new units either, as the cluster depends on the .charm-lock index to coordinate restarts. Adding a new unit will flush the unicast_hosts.txt file at the peer-changed execution. The ClusterTopology.nodes will not work

phvalguima commented 2 weeks ago

Closed with #407