canonical / opensearch-operator

OpenSearch operator
Apache License 2.0
9 stars 5 forks source link

Plugin manager doesn't handle case where charm deployed with config options set #280

Open Mehdi-Bendriss opened 2 months ago

Mehdi-Bendriss commented 2 months ago

The charm should handle from the get-go the possibility to deploy the charm with config options set and without the opensearch service being up.

github-actions[bot] commented 2 months ago

https://warthogs.atlassian.net/browse/DPE-4251

phvalguima commented 2 months ago

@Mehdi-Bendriss indeed, I've opened a PR to shuffle the checks on plugin manager.

phvalguima commented 2 months ago

Doing some digging @Mehdi-Bendriss

Back in the day, the decision to wait for the OpenSearch cluster was because we needed: (1) to cover the case some plugins manage things via API calls; (2) to know the opensearch version; and (3) we use the /_cluster/_settings to load the default settings. So, we needed the cluster up and running.

I am breaking this up into:

  1. Any plugin that needs to manage things via API call should check the health of the cluster using check_plugin_manager_health
  2. Moving opensearch_distro.version to load the workload_version file we have present instead of an API call
  3. We will waive the need of loading the default settings if this particular unit is powered down: which makes sense, in this moment we can do any config changes as we will eventually powered it back up later

That still frees the config_changed to just call plugin_manager.run() before everything is set, as the run() method changes hard configuration only.

phvalguima commented 2 months ago

That is going to demand some attention on OpenSearchKeystore. We have now to account to the case where the keystore was not yet created, because the 1st start is not finished. In this case, we need to save the keystore password for later and pass it to the opensearch at startup time instead of leaving the opensearch to manage that.

phvalguima commented 2 months ago

Whilst I agree we should deal with plugin_manager configuration independent of the cluster being ready or not, I did quite some digging into this issue and I've found out we are stuck on an endless loop of: opensearch-peers-changed hook issued > calls deferred config-changed > changes the content of peer databag > gets deferred > retriggers a new peers-changed

image

phvalguima commented 2 months ago

Now, this is caused by a change that happens within opensearch_peers_relation_changed, at the deferred config-changed, the peer databag starts with:

'deployment-description': '{"config": {"cluster_name": "backup-test", "init_hold": false, "roles": ["cluster_manager"], "data_temperature": null}, "start": "start-with-provided-roles", "pending_directives": [], "typ": "main-orchestrator", "app": "main", "state": {"value": "active", "message": ""}, "promotion_time": 1714724984.931757}'

And finishes with:

'deployment-description': '{"config": {"cluster_name": "backup-test", "init_hold": false, "roles": ["cluster_manager"], "data_temperature": null}, "start": "start-with-provided-roles", "pending_directives": [], "typ": "main-orchestrator", "app": "main", "state": {"value": "active", "message": ""}, "promotion_time": 1714724436.7171}'

That is caused by this: https://github.com/canonical/opensearch-operator/blob/46bc4a8f46905228a576c97d29de1572dc141bdd/lib/charms/opensearch/v0/models.py#L200

Which wrongly resets the promotion_time.

Full stack trace:

  /var/lib/juju/agents/unit-main-0/charm/src/charm.py(264)<module>()
-> main(OpenSearchOperatorCharm)
  /var/lib/juju/agents/unit-main-0/charm/venv/ops/main.py(544)main()
-> manager.run()
  /var/lib/juju/agents/unit-main-0/charm/venv/ops/main.py(520)run()
-> self._emit()
  /var/lib/juju/agents/unit-main-0/charm/venv/ops/main.py(506)_emit()
-> self.framework.reemit()
  /var/lib/juju/agents/unit-main-0/charm/venv/ops/framework.py(859)reemit()
-> self._reemit()
  /var/lib/juju/agents/unit-main-0/charm/venv/ops/framework.py(939)_reemit()
-> custom_handler(event)
  /var/lib/juju/agents/unit-main-0/charm/lib/charms/opensearch/v0/opensearch_base_charm.py(617)_on_config_changed()
-> previous_deployment_desc = self.opensearch_peer_cm.deployment_desc()
  /var/lib/juju/agents/unit-main-0/charm/lib/charms/opensearch/v0/opensearch_peer_clusters.py(323)deployment_desc()
-> return DeploymentDescription.from_dict(current_deployment_desc)
  /var/lib/juju/agents/unit-main-0/charm/lib/charms/opensearch/v0/models.py(39)from_dict()
-> return cls(**input_dict)
  /var/lib/juju/agents/unit-main-0/charm/venv/pydantic/main.py(339)__init__()
-> values, fields_set, validation_error = validate_model(__pydantic_self__.__class__, data)
  /var/lib/juju/agents/unit-main-0/charm/venv/pydantic/main.py(1100)validate_model()
-> values = validator(cls_, values)
> /var/lib/juju/agents/unit-main-0/charm/lib/charms/opensearch/v0/models.py(202)set_promotion_time()
-> if values["typ"] == DeploymentType.MAIN_ORCHESTRATOR: