canonical / opensearch-operator

OpenSearch operator
Apache License 2.0
12 stars 7 forks source link

Stale peer relation unit databag #482

Closed skourta closed 1 day ago

skourta commented 3 days ago

Steps to reproduce

The bug is flakey and sometimes does not happen.

  1. juju add-model opensearch
  2. juju model-config --file ./cloudinit-userdata.yaml
  3. juju deploy self-signed-certificates --config ca-common-name="CN_CA"
  4. juju deploy ./opensearch_ubuntu-22.04-amd64.charm -n 3
  5. juju integrate self-signed-certificates opensearch Wait for things to settle and opensearch to start
  6. juju config self-signed-certificates ca-common-name="NEW_CA"

Expected behavior

The units reboot one after other applying the new CA.

Actual behavior

The units do not reboot in order. Example, unit 0 applies new CA -> unit 1 applies new CA -> unit 0 applies new certs before unit 2 applies its new CA.

This is traced back to a stale databag state that checks if any of the other nodes are still CA rotating.

unit-opensearch-0: 06:58:01 DEBUG unit.opensearch/0.juju-log Checking if CA rotation is ongoing unit: opensearch/0
unit-opensearch-0: 06:58:01 DEBUG unit.opensearch/0.juju-log tls_ca_renewing: True             | tls_ca_renewed: True
unit-opensearch-0: 06:58:01 DEBUG unit.opensearch/0.juju-log Checking CA rotation checking relation <ops.model.Relation opensearch-peers:2>: units: {<ops.model.Unit opensearch/2>, <ops.model.Unit opensearch/1>}
unit-opensearch-0: 06:58:01 DEBUG unit.opensearch/0.juju-log Relation <ops.model.Relation opensearch-peers:2> data: {<ops.model.Unit opensearch/0>: {'egress-subnets': '10.176.246.147/32', 'ingress-address': '10.176.246.147', 'opensearch:unit:0:unit-http': 'secret://c20b5225-61e8-425f-8ed1-a5a69cee7dcd/cs7m70ceaqn39oag60v0', 'opensearch:unit:0:unit-transport': 'secret://c20b5225-61e8-425f-8ed1-a5a69cee7dcd/cs7m70ceaqn39oag60ug', 'private-address': '10.176.246.147', 'started': 'True', 'tls_ca_renewed': 'True', 'tls_ca_renewing': 'True', 'tls_configured': 'True'}, <ops.model.Application opensearch>: {'admin_user_initialized': 'True', 'bootstrap_contributors_count': '1', 'bootstrapped': 'True', 'client_relation_users': '{}', 'deployment-description': '{"app": {"id": "c20b5225-61e8-425f-8ed1-a5a69cee7dcd/opensearch", "model_uuid": "c20b5225-61e8-425f-8ed1-a5a69cee7dcd", "name": "opensearch", "short_id": "6aa"}, "config": {"cluster_name": "opensearch-rc9s", "data_temperature": null, "init_hold": false, "roles": []}, "pending_directives": [], "promotion_time": 1729061759.823109, "start": "start-with-generated-roles", "state": {"message": "", "value": "active"}, "typ": "main-orchestrator"}', 'nodes_config': '{"opensearch-0.6aa": {"app": {"id": "c20b5225-61e8-425f-8ed1-a5a69cee7dcd/opensearch", "model_uuid": "c20b5225-61e8-425f-8ed1-a5a69cee7dcd", "name": "opensearch", "short_id": "6aa"}, "ip": "10.176.246.147", "name": "opensearch-0.6aa", "roles": ["data", "ingest", "ml", "cluster_manager"], "temperature": null, "unit_number": 0}, "opensearch-1.6aa": {"app": {"id": "c20b5225-61e8-425f-8ed1-a5a69cee7dcd/opensearch", "model_uuid": "c20b5225-61e8-425f-8ed1-a5a69cee7dcd", "name": "opensearch", "short_id": "6aa"}, "ip": "10.176.246.143", "name": "opensearch-1.6aa", "roles": ["data", "ingest", "ml", "cluster_manager"], "temperature": null, "unit_number": 1}, "opensearch-2.6aa": {"app": {"id": "c20b5225-61e8-425f-8ed1-a5a69cee7dcd/opensearch", "model_uuid": "c20b5225-61e8-425f-8ed1-a5a69cee7dcd", "name": "opensearch", "short_id": "6aa"}, "ip": "10.176.246.214", "name": "opensearch-2.6aa", "roles": ["data", "ingest", "ml", "cluster_manager"], "temperature": null, "unit_number": 2}}', 'opensearch:app:admin-password': 'secret://c20b5225-61e8-425f-8ed1-a5a69cee7dcd/cs7m70keaqn39oag610g', 'opensearch:app:admin-password-hash': 'secret://c20b5225-61e8-425f-8ed1-a5a69cee7dcd/cs7m70keaqn39oag6110', 'opensearch:app:app-admin': 'secret://c20b5225-61e8-425f-8ed1-a5a69cee7dcd/cs7m70ceaqn39oag60u0', 'opensearch:app:kibanaserver-password': 'secret://c20b5225-61e8-425f-8ed1-a5a69cee7dcd/cs7m70keaqn39oag60vg', 'opensearch:app:kibanaserver-password-hash': 'secret://c20b5225-61e8-425f-8ed1-a5a69cee7dcd/cs7m70keaqn39oag6100', 'opensearch:app:monitor-password': 'secret://c20b5225-61e8-425f-8ed1-a5a69cee7dcd/cs7m79seaqn39oag6170', 'security_index_initialised': 'True'}, <ops.model.Unit opensearch/2>: {'egress-subnets': '10.176.246.214/32', 'ingress-address': '10.176.246.214', 'opensearch:unit:2:unit-http': 'secret://c20b5225-61e8-425f-8ed1-a5a69cee7dcd/cs7m74keaqn39oag615g', 'opensearch:unit:2:unit-transport': 'secret://c20b5225-61e8-425f-8ed1-a5a69cee7dcd/cs7m74keaqn39oag6150', 'private-address': '10.176.246.214', 'started': 'True', 'tls_configured': 'True'}, <ops.model.Unit opensearch/1>: {'egress-subnets': '10.176.246.143/32', 'ingress-address': '10.176.246.143', 'opensearch:unit:1:unit-http': 'secret://c20b5225-61e8-425f-8ed1-a5a69cee7dcd/cs7m70seaqn39oag6120', 'opensearch:unit:1:unit-transport': 'secret://c20b5225-61e8-425f-8ed1-a5a69cee7dcd/cs7m70seaqn39oag611g', 'private-address': '10.176.246.143', 'started': 'True', 'tls_configured': 'True'}}
unit-opensearch-0: 06:58:01 DEBUG unit.opensearch/0.juju-log Unit <ops.model.Unit opensearch/2>: tls_ca_renewing: None | tls_ca_renewed: None
unit-opensearch-0: 06:58:01 DEBUG unit.opensearch/0.juju-log Unit <ops.model.Unit opensearch/1>: tls_ca_renewing: None | tls_ca_renewed: None

The tls_ca_renewing should already be set and this can be seen if you execute a juju show-unit at that stage.

Versions

Operating system: Ubuntu 24.04.1 LTS

Juju CLI: 3.5.4-genericlinux-amd64

Juju agent: 3.5.3

Charm revision: 2/edge branch

LXD: 5.21.2 LTS

Log output

Juju debug log:

Additional context

logs_bug_happening.log logs_bug_not_happening.log

syncronize-issues-to-jira[bot] commented 3 days ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-5671.

This message was autogenerated