canonical / opensearch-operator

OpenSearch operator
Apache License 2.0
12 stars 7 forks source link

[DPE-4588] Hook start checks if service was previously running #407

Closed phvalguima closed 2 months ago

phvalguima commented 3 months ago

In this PR, we target to automatically process the service restart post full cluster reboot. Currently, neither charm nor systemd will restart the service back once the node has restarted.

Current Issue in the Charm

Once the node is restarted, it will throw the start hook once again. Each node will detect the new hook and run the routine outside of the if self.is_node_up(). Each unit will issue a restart-event and try to acquire the lock.

The first unit that acquires the lock will manage to get the peer-lock. That means only that unit has it and it can restart safely.

Once that unit is on, then all the remaining units will have to acquire the node-lock. However, the single service will not be fully online, as it will be an 1/2,3,X... nodes. Given it still has its own metadata informing of neighbours, then it will be blocked waiting for these neighbours.

Therefore, the entire cluster will get stuck waiting for the lock.

Proposal

We will focus only on nodes that have "cluster_manager" as one of the roles. All the other nodes should work with the lock requests in any case.

Akin to systemd's "enabled" concept: once a service is active, it will be started automatically after a reboot; the charm will now have a logic that detects a "start" hook (hence, a reboot may have happened) and: 1) Checks first if the unit is up (i.e. is_node_up() == True) 2) If not, check if the peer relation data stated this unit as "started" == "True". If yes, then we may have one of either: (i) deferred "start" that is wrapping its last task, or (ii) we have a reboot. The (i) case will be a exception (i.e. a "start" hook happened with service down and "started==True") 3) As we do not need to differ from (i) or (ii), then we execute the same clean-up logic as is_node_up()==True condition 4) Restart the systemd service 5) Finally, we restart the systemd service

phvalguima commented 3 months ago

Three things to improve: 1) Count all the units - both large and small deployments 2) we are not using actual restart service, so we need to remove the _post_start_init clean-up config 3) Make the is_service_started abstract on the opensearch_base_charm.py

phvalguima commented 3 months ago

I can see ha/test_large_deployments_relation.py has started to fail. Last time I noticed it successfully executed was 3 days ago

phvalguima commented 2 months ago

After some further investigation, it seems the community will abandon most of gateway.* option on the next major release. Discussion has been taken upstream on: https://github.com/opensearch-project/OpenSearch/issues/15599