Closed reneradoi closed 1 month ago
I believe the issue was caused by this change: https://github.com/canonical/opensearch-operator/pull/279/files#diff-f0ef92b3c155cc488340483aeb7a14f44e6577ccb3a328d8856fab4984072e57L248-R248
Proposal: solve the issue by reverting that change
also (for context), related: #243 (comment)
I've tested the proposed solution, but it only seems to work for the first unit to start up. As soon as the second unit is requesting the lock, it can't get it. This is from the juju debug-log
:
unit-opensearch-1: 07:01:13 DEBUG unit.opensearch/1.juju-log [Node lock] No unit has opensearch lock
unit-opensearch-1: 07:01:13 DEBUG unit.opensearch/1.juju-log [Node lock] Using peer databag for lock
unit-opensearch-1: 07:01:13 DEBUG unit.opensearch/1.juju-log [Node lock] Not acquired. Unit with peer databag lock: None
unit-opensearch-1: 07:01:13 DEBUG unit.opensearch/1.juju-log Lock to start opensearch not acquired. Will retry next event
And it goes on like this with no end.
@reneradoi can you attach full logs? would like to look at every [Node lock] log (can filter myself if easier)
the 2nd unit should use opensearch for the lock, not the peer databag—is the first unit of opensearch online & reachable?
Hi Carl, here's the full debug-log: opensearch_locking_debug_log.log
For my understanding: The condition is if not unit and online_nodes >= 2:
, where online_nodes
are the ones in the /_nodes
endpoint of the cluster. But if the 2nd node is not started yet, it is not in this list, and so length of this list is only 1?
For my understanding: The condition is
Oops, you're completely right. I was mixing up the stop case with the start case. Yes, it should use the peer databag when starting the 2nd unit and then on the 3rd unit it should use opensearch
@reneradoi Looking at the logs, it appears the issue is unit 0 (leader) is not granting the lock
Filtering logs to unit 0, we can see from 06:57:06 (when unit 1 requested lock) to end of logs, unit 0 is processing events—my guess is because of the deferred events and other events, it hasn't gotten the relation-changed event from unit 1 requesting the lock—so unit 0 needs to process the queued events and get the relation-changed event before unit 1 will get the lock
Issue
https://github.com/canonical/opensearch-operator/issues/309
Solution
Also release stale locks in the opensearch database from units no longer existing when releasing the lock a unit is currently holding.