Rolling Ops may be seeing "data drift" in databags

phvalguima commented 7 months ago

On a troubleshooting on Zookeeper alongside @marcoppenheimer, it is becoming clear that the databag is varying between relations. In this case specifically, it is returning one unit with lock state: "acquire" and another with state: "released".

Full pastebin: https://pastebin.com/mJmvdKz5

In this deployment, there are 3x zookeeper units, where zookeeper/0 is the leader. Each zookeeper has some extra logs that says when it's started the restarting. Each restarting should take at least 5 seconds, as there is a time.sleep(5) in the code, right after the actual zk restart.

unit-zookeeper-0: 14:36:24 INFO unit.zookeeper/0.juju-log cluster:1: relation.data={<ops.model.Unit zookeeper/0>: {'egress-subnets': '10.70.223.177/32', 'ingress-address': '10.70.223.177', 'private-address': '10.70.223.177', 'state': 'acquire'}, <ops.model.Application zookeeper>: {'<ops.model.Unit zookeeper/0>': 'idle', '<ops.model.Unit zookeeper/1>': 'granted', '<ops.model.Unit zookeeper/2>': 'idle'}, <ops.model.Unit zookeeper/2>: {'egress-subnets': '10.70.223.3/32', 'ingress-address': '10.70.223.3', 'private-address': '10.70.223.3', 'state': 'release'}, <ops.model.Unit zookeeper/1>: {'egress-subnets': '10.70.223.86/32', 'ingress-address': '10.70.223.86', 'private-address': '10.70.223.86', 'state': 'acquire'}}

........

unit-zookeeper-1: 14:36:24 INFO unit.zookeeper/1.juju-log cluster:1: Calling callback - _restart...
unit-zookeeper-1: 14:36:24 INFO unit.zookeeper/1.juju-log cluster:1: zookeeper/1 restarting...
unit-zookeeper-0: 14:36:24 DEBUG unit.zookeeper/0.juju-log restart:2: ops 2.9.0 up and running.
unit-zookeeper-0: 14:36:24 DEBUG unit.zookeeper/0.juju-log restart:2: Emitting Juju event restart_relation_changed.
unit-zookeeper-0: 14:36:24 DEBUG unit.zookeeper/0.juju-log restart:2: Emitting custom event <ProcessLocks via ZooKeeperCharm/on/restart_process_locks[253]>.
unit-zookeeper-0: 14:36:24 INFO unit.zookeeper/0.juju-log restart:2: relation.data={<ops.model.Unit zookeeper/0>: {'egress-subnets': '10.70.223.177/32', 'ingress-address': '10.70.223.177', 'private-address': '10.70.223.177', 'state': 'acquire'}, <ops.model.Application zookeeper>: {'<ops.model.Unit zookeeper/0>': 'idle', '<ops.model.Unit zookeeper/1>': 'granted', '<ops.model.Unit zookeeper/2>': 'idle'}, <ops.model.Unit zookeeper/2>: {'egress-subnets': '10.70.223.3/32', 'ingress-address': '10.70.223.3', 'private-address': '10.70.223.3', 'state': 'release'}, <ops.model.Unit zookeeper/1>: {'egress-subnets': '10.70.223.86/32', 'ingress-address': '10.70.223.86', 'private-address': '10.70.223.86', 'state': 'release'}}

I believe the data bags between restart and cluster relations are different. In one, we are seeing zookeeper/1 with state acquire, and another, it has been marked as released.

github-actions[bot] commented 7 months ago

https://warthogs.atlassian.net/browse/DPE-3590

phvalguima commented 7 months ago

I have discussed this issue with John Meinel, from Juju team. There is a chance that we have here diverging data bags. It is hard to prove it definitively, as it means looking into how Juju controller is keeping records in its databases at that time. It is also not an issue for Juju itself. Therefore, I will add this fix to the upcoming PR for deferred events.

For more details, check this conversation in the public Matrix chat

jameinel commented 7 months ago

Just to clarify a bit here. If you have 2 applications that are related on 2 relations (I believe the specifics here were 2 peer relations, but I would expect that to behave similarly.)

If appA/0 sets a value for relations X and Y, I would expect there to be two events pending for appB/0. You won't see data be updated for a relation before its associated -relation-changed event has fired.

We don't guarantee that X or Y will fire first, but whichever one does, the other one will not have their data updated yet. (eg, X-relation-changed, Y-relation-changed, in the first hook you will see the updated values from appA/0 for X, but not for Y, once Y-relation-changed has fired, you will then see both X and Y because you have been told about both.)

In this particular case, it was 2 peer relations, where they were trying to set something in one peer relation based on a status in the other peer relation, but it is never true that both will be updated atomically. We never fire "X&Y-relation-changed" and we don't update the values that we can read before you get the event that they are changing. (Otherwise you have to worry on every hook if any other data bag has changed.)

phvalguima commented 7 months ago

hi @jameinel, thanks for the clarifications.

I think the solution here is to have rolling-ops only considering hooks from its own peer-relation data. One test I've done is:

    def _on_relation_changed(self: CharmBase, event: RelationChangedEvent):

        if self.name not in os.environ.get("JUJU_DISPATCH_PATH", ""):
            return

This assures we will always listen to the peer relation that matters for the rolling-ops.

tonyandrewmeyer commented 6 months ago

@phvalguima while you're working in this, I noticed two small bugs that won't cause any problems but might as well get fixed:

the type in the acquire lock method shouldn't be ActionEvent
the is in this comparison should really be ==. Small strings get interned plus it's an enum, but it's more correct and matches all the other checks above it.

phvalguima commented 6 months ago

After discussing with Ben and @tonyandrewmeyer, they recommended a more elegant solution.

In a nutshell, we should not be seeing restart-relation-changed or any other related *-lock event happening in other hook contexts. The reason that maybe be happening is because of this relation-changed emit, which will cause the entire relation-changed logic to be called outside of its corresponding hook.

We should try removing it first.

There is one scenario, in this case, we have to consider. If we call _on_acquire_lock and the lock is already in state: acquire, then there will be no practical relation-changed. The way I see, this is actually a good thing (i.e. a filter of many acquire.emit() that will result in one single granted eventually).

canonical / charm-rolling-ops

Rolling Ops may be seeing "data drift" in databags #13