[DPE-4421] Release stale node locks

canonical / opensearch-operator

OpenSearch operator

Apache License 2.0

9 stars 5 forks source link

[DPE-4421] Release stale node locks #312

Closed reneradoi closed 1 month ago

reneradoi commented 1 month ago

Issue

https://github.com/canonical/opensearch-operator/issues/309

Solution

Also release stale locks in the opensearch database from units no longer existing when releasing the lock a unit is currently holding.

reneradoi commented 1 month ago

I believe the issue was caused by this change: https://github.com/canonical/opensearch-operator/pull/279/files#diff-f0ef92b3c155cc488340483aeb7a14f44e6577ccb3a328d8856fab4984072e57L248-R248

Proposal: solve the issue by reverting that change

also (for context), related: #243 (comment)

I've tested the proposed solution, but it only seems to work for the first unit to start up. As soon as the second unit is requesting the lock, it can't get it. This is from the juju debug-log:

unit-opensearch-1: 07:01:13 DEBUG unit.opensearch/1.juju-log [Node lock] No unit has opensearch lock
unit-opensearch-1: 07:01:13 DEBUG unit.opensearch/1.juju-log [Node lock] Using peer databag for lock
unit-opensearch-1: 07:01:13 DEBUG unit.opensearch/1.juju-log [Node lock] Not acquired. Unit with peer databag lock: None
unit-opensearch-1: 07:01:13 DEBUG unit.opensearch/1.juju-log Lock to start opensearch not acquired. Will retry next event

And it goes on like this with no end.

carlcsaposs-canonical commented 1 month ago

@reneradoi can you attach full logs? would like to look at every [Node lock] log (can filter myself if easier)

the 2nd unit should use opensearch for the lock, not the peer databag—is the first unit of opensearch online & reachable?

reneradoi commented 1 month ago

Hi Carl, here's the full debug-log: opensearch_locking_debug_log.log

For my understanding: The condition is if not unit and online_nodes >= 2:, where online_nodes are the ones in the /_nodes endpoint of the cluster. But if the 2nd node is not started yet, it is not in this list, and so length of this list is only 1?

carlcsaposs-canonical commented 1 month ago

For my understanding: The condition is

Oops, you're completely right. I was mixing up the stop case with the start case. Yes, it should use the peer databag when starting the 2nd unit and then on the 3rd unit it should use opensearch

carlcsaposs-canonical commented 1 month ago

@reneradoi Looking at the logs, it appears the issue is unit 0 (leader) is not granting the lock

lock-logs.txt

Filtering logs to unit 0, we can see from 06:57:06 (when unit 1 requested lock) to end of logs, unit 0 is processing events—my guess is because of the deferred events and other events, it hasn't gotten the relation-changed event from unit 1 requesting the lock—so unit 0 needs to process the queued events and get the relation-changed event before unit 1 will get the lock

o0.log