Closed Mehdi-Bendriss closed 2 weeks ago
I can reproduce this issue with: 1) Deploy an LXD cluster 2) Restart the host machine after the cluster has settled
Once the machine is back, the cluster cannot settle anymore. A systemd status
on each of the nodes shows: https://pastebin.ubuntu.com/p/MMFvSVvNZx/
One of the 3x units are up and cannot find the cluster manager anymore:
Jul 02 12:44:44 juju-c52a5b-0 opensearch.daemon[1076]: [2024-07-02T12:44:44,159][WARN ][o.o.c.c.ClusterFormationFailureHelper] [opensearch-0] cluster-manager not discovered or elected yet, an election requires at least 2 nodes with ids from [oTRIYXJ-QQ61M9fKQpfw5Q, 97OSchtyRwq-9rgd_zfoiw, OLvEl2U6TRa8WoXmGAIuoQ], have discovered [{opensearch-0}{OLvEl2U6TRa8WoXmGAIuoQ}{_F2PYiagQ6qQ-oZplrwBZg}{10.175.87.40}{10.175.87.40:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true}] which is not a quorum; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, 127.0.0.1:9300] from hosts providers and [{opensearch-0}{OLvEl2U6TRa8WoXmGAIuoQ}{_F2PYiagQ6qQ-oZplrwBZg}{10.175.87.40}{10.175.87.40:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 50 in term 1
That means, the opensearch/0 restarted because it was the leader charm unit, but given it is not the cluster manager leader, it will not let the other units start as they cannot grab the lock anymore.
I reran the same test and indeed I am seeing the same problem once again. However, manually restarting the remaining CM nodes that were powered down brings the cluster back.
I also tried the following scenarios:
node-1
node-2
(which is not the cluster manager): ensure the metadata has a more recent version than node-1'snode-3
Therefore, I can see we can bring nodes up in any other without not much damage to the cluster in this scenario.
When we restart a node, the entire start
/ config-changed
hook sequence will re-run.
I propose the following:
At start
hook, we check: if already had the opensearch snap installed and the peer relation is set with a started
flag -> start the service
This way, all the eligible CMs will come back online and eventually the cluster will form again.
The other option we have is to set the systemd service as "enabled", which means opensearch will automatically restart after reboot. I believe this option is preferable.
CONT
Now, the main challenge in my view of a full-restart as suggested, is the risk of split-brains at certain moments of the restart.
Deployed 6x nodes opensearch and: 1) Stopped the last 3x units (not-elected leaders) 2) Inserted a 6-replica index and pushed 100 docs 3) Stopped the first 3x units (with the elected leader) 4) Restarted the last 3x units
The last 3x units have a valid unicast_hosts.txt
file and can find each other. However, each unit complains of missing an extra known unit to restart the cluster:
Jul 16 11:25:13 juju-a56f65-3 opensearch.daemon[19725]: [2024-07-16T11:25:13,727][WARN ][o.o.c.c.ClusterFormationFailureHelper] [opensearch-3.ddf] cluster-manager not discovered or elected yet, an election requires at least 3 nodes with ids from [JhXcRAMmSGW5ElV6ixjkZQ, VeBf6GTdQbCKUzio85L8Cg, qv5k5erhRvm6TOQUYF5uPA, HzYBd9PPSk28AyU4gO63jQ, qyAQDXL6SbOrJ9PEp9FTWw], have discovered [{opensearch-3.ddf}{JhXcRAMmSGW5ElV6ixjkZQ}{ijs2FK6ZQCO8C3X1g6RXAg}{10.225.137.32}{10.225.137.32:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=fdfaa772-90ee-4e84-8e56-f06d0ea56f65/opensearch}, {opensearch-4.ddf}{qv5k5erhRvm6TOQUYF5uPA}{yRRlrZW-TTO8r0vlJ2lAmw}{10.225.137.85}{10.225.137.85:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=fdfaa772-90ee-4e84-8e56-f06d0ea56f65/opensearch}, {opensearch-5.ddf}{xb1RNEmURCqnjxJj9_zT-A}{q9m-eMYLQQGhF_Af1FiDjQ}{10.225.137.188}{10.225.137.188:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=fdfaa772-90ee-4e84-8e56-f06d0ea56f65/opensearch}] which is not a quorum; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, 10.225.137.203:9300, 10.225.137.155:9300, 10.225.137.85:9300, 10.225.137.182:9300, 10.225.137.188:9300] from hosts providers and [{opensearch-3.ddf}{JhXcRAMmSGW5ElV6ixjkZQ}{ijs2FK6ZQCO8C3X1g6RXAg}{10.225.137.32}{10.225.137.32:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=fdfaa772-90ee-4e84-8e56-f06d0ea56f65/opensearch}] from last-known cluster state; node term 1, last-accepted version 78 in term 1
The new unit, from the first 3x nodes, cannot connect to the cluster, as it does not recognize the last 3x opensearch units as having valid metadata:
Jul 16 11:30:27 juju-a56f65-1 opensearch.daemon[21522]: [2024-07-16T11:30:27,500][WARN ][o.o.c.c.ClusterFormationFailureHelper] [opensearch-1.ddf] cluster-manager not discovered or elected yet, an election requires at least 2 nodes with ids from [VeBf6GTdQbCKUzio85L8Cg, HzYBd9PPSk28AyU4gO63jQ, qyAQDXL6SbOr
J9PEp9FTWw], have discovered [{opensearch-1.ddf}{HzYBd9PPSk28AyU4gO63jQ}{TJAI0urnQUOMfrSf_bqQFQ}{10.225.137.155}{10.225.137.155:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=fdfaa772-90ee-4e84-8e56-f06d0ea56f65/opensearch}, {opensearch-3.ddf}{JhXcRAMmSGW5ElV6ixjkZQ}{ijs2FK6ZQCO
8C3X1g6RXAg}{10.225.137.32}{10.225.137.32:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=fdfaa772-90ee-4e84-8e56-f06d0ea56f65/opensearch}, {opensearch-4.ddf}{qv5k5erhRvm6TOQUYF5uPA}{yRRlrZW-TTO8r0vlJ2lAmw}{10.225.137.85}{10.225.137.85:9300}{coordinating_onlydimml}{shard_indexing
_pressure_enabled=true, app_id=fdfaa772-90ee-4e84-8e56-f06d0ea56f65/opensearch}, {opensearch-5.ddf}{xb1RNEmURCqnjxJj9_zT-A}{q9m-eMYLQQGhF_Af1FiDjQ}{10.225.137.188}{10.225.137.188:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=fdfaa772-90ee-4e84-8e56-f06d0ea56f65/opensearch}] whi
ch is not a quorum; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, 10.225.137.182:9300, 10.225.137.203:9300, 10.225.137.32:9300, 10.225.137.85:9300, 10.225
.137.188:9300] from hosts providers and [{opensearch-1.ddf}{HzYBd9PPSk28AyU4gO63jQ}{TJAI0urnQUOMfrSf_bqQFQ}{10.225.137.155}{10.225.137.155:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=fdfaa772-90ee-4e84-8e56-f06d0ea56f65/opensearch}] from last-known cluster state; node term 2,
last-accepted version 121 in term 2
It is expecting a subset of the 6x nodes, although they are all eligible cluster managers.
Restarting all nodes eventually brings the cluster back to a healthy status and all 6x nodes are shown in /_cat/nodes
. The opensearch/1
came up as the elected manager now.
I ran another test, this time targeting a split brain scenario, which may happen in full power-off / full power-on:
Deploying 5x nodes opensearch units, I've:
The cluster, as expected, is not coming up due to missing units:
Jul 16 10:58:20 juju-354fd8-3 opensearch.daemon[24528]: [2024-07-16T10:58:20,139][WARN ][o.o.c.c.ClusterFormationFailureHelper] [opensearch-3.d6c] cluster-manager not discovered or elected yet, an election requires at least 3 nodes with ids from [41bzeg-yRSq47P6qDP9R5w, YDjXVrHLR425K3pB2OreWg, qpf4jqIvRpWq
FZzVXP0vnQ, B1yc-Fk0TDu9DTAQ0vXZ0g, ygFPNW0VTfmk_5AXAekfPQ], have discovered [{opensearch-3.d6c}{ygFPNW0VTfmk_5AXAekfPQ}{xROcg9LjScWc96IsPXoilg}{10.225.137.179}{10.225.137.179:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=a018c111-55b7-446a-89b2-1a014a354fd8/opensearch}, {opens
earch-4.d6c}{qpf4jqIvRpWqFZzVXP0vnQ}{7UcFZnjwTt-6yfKKYzPEEg}{10.225.137.56}{10.225.137.56:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=a018c111-55b7-446a-89b2-1a014a354fd8/opensearch}] which is not a quorum; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.
0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, 10.225.137.26:9300, 10.225.137.56:9300, 10.225.137.45:9300, 10.225.137.242:9300] from hosts providers and [{opensearch-3.d6c}{ygFPNW0VTfmk_5AXAekfPQ}{xROcg9LjScWc96IsPXoilg}{10.
225.137.179}{10.225.137.179:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=a018c111-55b7-446a-89b2-1a014a354fd8/opensearch}] from last-known cluster state; node term 1, last-accepted version 122 in term 1
It is not possible to add new units either, as the cluster depends on the .charm-lock
index to coordinate restarts. Adding a new unit will flush the unicast_hosts.txt
file at the peer-changed execution. The ClusterTopology.nodes
will not work
Closed with #407
When a the LXD host machine reboots the charm fails to get the opensearch service to run, raising 503 http errors in the process.
Server logs:
This is most likely due to the
unicast_hosts.txt
andopensearch.yml
(initial_cluster_manager) not being properly cleaned / removing all its content (offline cm eligible nodes) before starting the start process.