Closed phvalguima closed 2 months ago
Three things to improve:
1) Count all the units - both large and small deployments
2) we are not using actual restart service, so we need to remove the _post_start_init clean-up config
3) Make the is_service_started abstract on the opensearch_base_charm.py
I can see ha/test_large_deployments_relation.py has started to fail. Last time I noticed it successfully executed was 3 days ago
After some further investigation, it seems the community will abandon most of gateway.*
option on the next major release. Discussion has been taken upstream on: https://github.com/opensearch-project/OpenSearch/issues/15599
In this PR, we target to automatically process the service restart post full cluster reboot. Currently, neither charm nor systemd will restart the service back once the node has restarted.
Current Issue in the Charm
Once the node is restarted, it will throw the
start
hook once again. Each node will detect the new hook and run the routine outside of theif self.is_node_up()
. Each unit will issue arestart-event
and try to acquire the lock.The first unit that acquires the lock will manage to get the peer-lock. That means only that unit has it and it can restart safely.
Once that unit is on, then all the remaining units will have to acquire the node-lock. However, the single service will not be fully online, as it will be an 1/2,3,X... nodes. Given it still has its own metadata informing of neighbours, then it will be blocked waiting for these neighbours.
Therefore, the entire cluster will get stuck waiting for the lock.
Proposal
We will focus only on nodes that have "cluster_manager" as one of the roles. All the other nodes should work with the lock requests in any case.
Akin to systemd's "enabled" concept: once a service is active, it will be started automatically after a reboot; the charm will now have a logic that detects a "start" hook (hence, a reboot may have happened) and: 1) Checks first if the unit is up (i.e.
is_node_up() == True
) 2) If not, check if the peer relation data stated this unit as "started" == "True". If yes, then we may have one of either: (i) deferred "start" that is wrapping its last task, or (ii) we have a reboot. The (i) case will be a exception (i.e. a "start" hook happened with service down and "started==True") 3) As we do not need to differ from (i) or (ii), then we execute the same clean-up logic asis_node_up()==True
condition 4) Restart the systemd service 5) Finally, we restart the systemd service