Storage reuse doesn't work if we use a new application
Solution
There are multiple problems to face when reusing storage in a new application:
We need to be able to detect that we are using a new application and not adding back to the same replica set.
We cannot suppose we have kept all users passwords so we need to change it.
We might have to change the replica set name.
We need in any case to reconfigure the replicaset.
The solution chosen here is the following:
We store on the storage volume a file containing a random string
This string is also stored as a secret (because we have access to app secrets in install event, which is not the case for the app peer databag)
If we detect that, after installing charmed mongodb, we'll start mongodb in a degraded mode: no replicaset, no auth validation. This allows us to patch the deployment
Then we init the replica set, but we need to use authentication here in order to ensure that it works, because we already have the users in this case.
We add one more optional reconfigure : for at least two situations, it can happen that all IP changes. This leads to the replica set being fully broken, no host being able to find the other members of the replica set. In order to fix this, a new method to forcefully reconfigure the replica set is added in a specific case : None of the IPs in the mongoDB replica set config is matching the IPs in the replica set config in the Databag. This is achieved by opening a standalone connection to the node (which doesn't require any server selection in the replicaset) and getting the config (not the status which does requires to be connected to other nodes).
The (not so big) drawback:
Some of the most recent data might be lost (sic). WiredTiger writes the snapshots to disk every 60 seconds, which means some of the last data might be lost.
I tried to go to different solutions (update users and rename replica set only) in order to avoid that, which ended up in: Core dump of mongodb (yes…), MongoDB restarting in loop because of some config collections being broken. I haven't been able to figure out what the issue was, so I decided to go with this in-between solution.
I find this is an acceptable solution granted that:
If we're reusing storage in a new application willingly, we can leave time for snapshot to be written on disk
If we reusing storage after a crash, it's an acceptable loss to lose the very latest data.
Implements
Storage reuse on the same cluster (scale down, scale to zero).
Storage reuse on a different cluster with same app name.
Storage reuse on a different cluster with different app name.
Restore after a full cluster crash.
Suggestions for the future
A big part of the issues we have come from the fact that we use IPs and no DNS inside the cluster.
Having a DNS in the cluster would help as network crash or machines restarting wouldn't change the name of each unit, and the cluster would be able to survive IP changes.
An integration to external secrets could help make this easier: deploying a new application connected to external secrets would help not losing the secrets in the first application and just restarting as before, with "only" big reconfigurations of hosts.
Issue
Solution
There are multiple problems to face when reusing storage in a new application:
The solution chosen here is the following:
install
event, which is not the case for the app peer databag)config
(not thestatus
which does requires to be connected to other nodes).The (not so big) drawback:
Implements
Suggestions for the future