Open Mehdi-Bendriss opened 2 months ago
@Mehdi-Bendriss indeed, I've opened a PR to shuffle the checks on plugin manager.
Doing some digging @Mehdi-Bendriss
Back in the day, the decision to wait for the OpenSearch cluster was because we needed: (1) to cover the case some plugins manage things via API calls; (2) to know the opensearch version; and (3) we use the /_cluster/_settings
to load the default settings. So, we needed the cluster up and running.
I am breaking this up into:
check_plugin_manager_health
opensearch_distro.version
to load the workload_version
file we have present instead of an API callThat still frees the config_changed
to just call plugin_manager.run()
before everything is set, as the run()
method changes hard configuration only.
That is going to demand some attention on OpenSearchKeystore
. We have now to account to the case where the keystore was not yet created, because the 1st start is not finished. In this case, we need to save the keystore password for later and pass it to the opensearch at startup time instead of leaving the opensearch to manage that.
Whilst I agree we should deal with plugin_manager
configuration independent of the cluster being ready or not, I did quite some digging into this issue and I've found out we are stuck on an endless loop of: opensearch-peers-changed hook issued > calls deferred config-changed > changes the content of peer databag > gets deferred > retriggers a new peers-changed
Now, this is caused by a change that happens within opensearch_peers_relation_changed
, at the deferred config-changed
, the peer databag starts with:
'deployment-description': '{"config": {"cluster_name": "backup-test", "init_hold": false, "roles": ["cluster_manager"], "data_temperature": null}, "start": "start-with-provided-roles", "pending_directives": [], "typ": "main-orchestrator", "app": "main", "state": {"value": "active", "message": ""}, "promotion_time": 1714724984.931757}'
And finishes with:
'deployment-description': '{"config": {"cluster_name": "backup-test", "init_hold": false, "roles": ["cluster_manager"], "data_temperature": null}, "start": "start-with-provided-roles", "pending_directives": [], "typ": "main-orchestrator", "app": "main", "state": {"value": "active", "message": ""}, "promotion_time": 1714724436.7171}'
That is caused by this: https://github.com/canonical/opensearch-operator/blob/46bc4a8f46905228a576c97d29de1572dc141bdd/lib/charms/opensearch/v0/models.py#L200
Which wrongly resets the promotion_time
.
Full stack trace:
/var/lib/juju/agents/unit-main-0/charm/src/charm.py(264)<module>()
-> main(OpenSearchOperatorCharm)
/var/lib/juju/agents/unit-main-0/charm/venv/ops/main.py(544)main()
-> manager.run()
/var/lib/juju/agents/unit-main-0/charm/venv/ops/main.py(520)run()
-> self._emit()
/var/lib/juju/agents/unit-main-0/charm/venv/ops/main.py(506)_emit()
-> self.framework.reemit()
/var/lib/juju/agents/unit-main-0/charm/venv/ops/framework.py(859)reemit()
-> self._reemit()
/var/lib/juju/agents/unit-main-0/charm/venv/ops/framework.py(939)_reemit()
-> custom_handler(event)
/var/lib/juju/agents/unit-main-0/charm/lib/charms/opensearch/v0/opensearch_base_charm.py(617)_on_config_changed()
-> previous_deployment_desc = self.opensearch_peer_cm.deployment_desc()
/var/lib/juju/agents/unit-main-0/charm/lib/charms/opensearch/v0/opensearch_peer_clusters.py(323)deployment_desc()
-> return DeploymentDescription.from_dict(current_deployment_desc)
/var/lib/juju/agents/unit-main-0/charm/lib/charms/opensearch/v0/models.py(39)from_dict()
-> return cls(**input_dict)
/var/lib/juju/agents/unit-main-0/charm/venv/pydantic/main.py(339)__init__()
-> values, fields_set, validation_error = validate_model(__pydantic_self__.__class__, data)
/var/lib/juju/agents/unit-main-0/charm/venv/pydantic/main.py(1100)validate_model()
-> values = validator(cls_, values)
> /var/lib/juju/agents/unit-main-0/charm/lib/charms/opensearch/v0/models.py(202)set_promotion_time()
-> if values["typ"] == DeploymentType.MAIN_ORCHESTRATOR:
The charm should handle from the get-go the possibility to deploy the charm with config options set and without the opensearch service being up.