[DPE-5218] Implement restore flow

This PR implements the flow for restoring a ZooKeeper snapshot.

Other changes

Add retry configuration to S3 resource in case of flaky network.
Recursively update ACLs if the requested zNode already exists
Write a client-jaas.cfg file to streamline a little bit the process of accessing the ZK cluster within the machine.
Update ACLs to only let the superuser access the path for zNodes that are not currently requested by a related client application

Use cases

This flow can be used to bootstrap a new cluster using seeded data. Upon relating with a new client application, let's say, Kafka, here is what is going to happen:

the ACLs for /kafka and its subnodes will be updated to the new relation, if it already existed
Kafka's relation flow will recreate its admin and sync users, but, we will keep others pre-existing users (and other sub zNodes around).
The other zNodes which were already present in the restored snapshot will see their ACL reset to only the ZK superuser, instead of being deleted

With a minimal change in the Kafka charm, we can also restore a snapshot on an already related ZK application. The chain of events will be as follows:

Clients will be disconnected by removing the endpoints field from the databag. On Kafka, this has the effect of triggering the relation-changed event and setting the application's status to ZK_NO_DATA
We can then follow the same procedure as above after restoring the snapshot. Re-writing the databag with the required fields will trigger the relation-changed event. Kafka relation flow will then rotate the admin and sync user passwords

Here is the patch for Kafka:

diff --git a/src/events/zookeeper.py b/src/events/zookeeper.py
index 4b4f5a5..5a1adc0 100644
--- a/src/events/zookeeper.py
+++ b/src/events/zookeeper.py
@@ -71,7 +71,7 @@ class ZooKeeperHandler(Object):
             event.defer()
             return

-        if not self.charm.state.cluster.internal_user_credentials and self.model.unit.is_leader():
+        if self.model.unit.is_leader():
             # loading the minimum config needed to authenticate to zookeeper
             self.dependent.config_manager.set_zk_jaas_config()
             self.dependent.config_manager.set_server_properties()
@@ -87,6 +87,12 @@ class ZooKeeperHandler(Object):
             for username, password in internal_user_credentials:
                 self.charm.state.cluster.update({f"{username}-password": password})

+        # Kafka keeps a meta.properties in every log.dir with a unique ClusterID
+        # this ID is provided by ZK, and removing it on relation-changed allows
+        # re-joining to another ZK cluster while restoring.
+        for storage in self.charm.model.storages["data"]:
+            self.charm.workload.exec(["rm", f"{storage.location}/meta.properties"])
+
         # attempt re-start of Kafka for all units on zookeeper-changed
         # avoids relying on deferred events elsewhere that may not exist after cluster init
         if not self.dependent.healthy:

About the restore flow

Since we need to operate the workload on all units, we use a chain of events instead of staying within the context of the action execution.
We want to synchronize the units while they follow these steps:
- stop the workload
- move the existing data away, and download the snapshot there
- restart the workload
- cleanup leftover files
  
  Therefore, we do not use the rolling ops lib, and use a different peer relation to manage this flow, making sure all units are done with a step before moving on to the next. ~~- Instead of using the peer cluster relation, we now have a new restore one. There are two main reasons why:~~ ~~- it makes following the chain of events way easier because the dozens of events fired during the flow do not trigger the "something changed" cluster-relation-changed~~ - this separation of concerns guarantees that if anything goes wrong, we can ssh into the machine to solve this issue, and then continue with the flow using juju resolve. This is also why we do not have defers or try except for the exec calls NEW: we use the peer cluster relation instead of the dedicated restore from earlier commits.

TODO

[x] Add action message after initiating the restore flow
[x] Notify clients upon completion. How can we restart the relation-changed chain of events?
[x] Block / defer other operations during restore
[x] Add integration tests
[x] Move around stuff? Should the new peer relation interface go some place else, like core/cluster or core/models?
[x] Reset ACLs instead of removing the zNodes upon relation departure, following the team discussion on MM
[x] Cleanup old files

Demo

https://github.com/user-attachments/assets/07a5d880-1b18-4668-9eea-fbc18287fb04

canonical / zookeeper-operator

[DPE-5218] Implement restore flow #162

Other changes

Use cases

About the restore flow

TODO

Demo