canonical / opensearch-operator

OpenSearch operator
Apache License 2.0
12 stars 7 forks source link

Scaling 3 -> 0 -> 3 results in cluster stuck waiting for TLS / restart #496

Open phvalguima opened 2 weeks ago

phvalguima commented 2 weeks ago

It seems there is a functional difference between remove app / redeploy and scaling from 3->0->3.

The result is a cluster stuck on (re)initializing:

$ juju status
Model       Controller        Cloud/Region      Version  SLA          Timestamp
opensearch  azure-westeurope  azure/westeurope  3.4.4    unsupported  11:21:15Z

App                       Version  Status   Scale  Charm                     Channel        Rev  Exposed  Message
opensearch                         active       3  opensearch                2/edge         185  no       
opensearch-dashboards              blocked      1  opensearch-dashboards     2/stable        22  no       Opensearch service is (partially or fully) down
self-signed-certificates           active       1  self-signed-certificates  latest/stable  155  no       
sysconfig                          active       3  sysconfig                 latest/stable   33  no       ready

Unit                         Workload     Agent      Machine  Public address  Ports     Message
opensearch-dashboards/0*     blocked      idle       1        172.18.0.14     5601/tcp  Opensearch service is (partially or fully) down
opensearch/6*                waiting      executing  8        172.18.0.19               Waiting for OpenSearch to start...
  sysconfig/67               active       idle                172.18.0.19               ready
opensearch/7                 maintenance  executing  9        172.18.0.20               Waiting for TLS to be fully configured...
  sysconfig/68               active       idle                172.18.0.20               ready
opensearch/8                 maintenance  executing  10       172.18.0.18               Waiting for TLS to be fully configured...
  sysconfig/66*              active       idle                172.18.0.18               ready
self-signed-certificates/0*  active       idle       0        172.18.0.15               

Machine  State    Address      Inst id             Base          AZ  Message
0        started  172.18.0.15  juju-a49dc1-0       ubuntu@22.04      
1        started  172.18.0.14  juju-a49dc1-1       ubuntu@22.04      
8        started  172.18.0.19  manual:172.18.0.19  ubuntu@22.04      Manually provisioned machine
9        started  172.18.0.20  manual:172.18.0.20  ubuntu@22.04      Manually provisioned machine
10       started  172.18.0.18  manual:172.18.0.18  ubuntu@22.04      Manually provisioned machine

In the later case, it is possible to see, in the app-level databag, stale data from the older 3x unit that were removed. For example, in this show-unit, we can see: https://pastebin.canonical.com/p/VC7vCvvPSN/, we can still see data from units that were gone, such as opensearch/0:

...

  - relation-id: 2
    endpoint: opensearch-peers
    related-endpoint: opensearch-peers
    application-data:
      admin_user_initialized: "True"
      allocation-exclusions-to-delete: opensearch-4.715,opensearch-1.715
      bootstrap_contributors_count: "3"
      bootstrapped: "True"
      client_relation_users: '{}'
      delete-voting-exclusions: opensearch-4.715,opensearch-1.715
      deployment-description: '{"app": {"id": "65b68abb-a725-4dfa-895d-13d396a49dc1/opensearch",
        "model_uuid": "65b68abb-a725-4dfa-895d-13d396a49dc1", "name": "opensearch",
        "short_id": "715"}, "config": {"cluster_name": "opensearch-hucz", "data_temperature":
        null, "init_hold": false, "profile": "production", "roles": []}, "pending_directives":
        [], "promotion_time": 1729778096.196239, "start": "start-with-generated-roles",
        "state": {"message": "", "value": "active"}, "typ": "main-orchestrator"}'
      nodes_config: '{"opensearch-0.715": {"app": {"id": "65b68abb-a725-4dfa-895d-13d396a49dc1/opensearch",
        "model_uuid": "65b68abb-a725-4dfa-895d-13d396a49dc1", "name": "opensearch",
        "short_id": "715"}, "ip": "172.18.0.12", "name": "opensearch-0.715", "roles":
        ["data", "ingest", "ml", "cluster_manager"], "temperature": null, "unit_number":
        0}, "opensearch-2.715": {"app": {"id": "65b68abb-a725-4dfa-895d-13d396a49dc1/opensearch",
        "model_uuid": "65b68abb-a725-4dfa-895d-13d396a49dc1", "name": "opensearch",
        "short_id": "715"}, "ip": "172.18.0.13", "name": "opensearch-2.715", "roles":
        ["data", "ingest", "ml", "cluster_manager"], "temperature": null, "unit_number":
        2}}'
syncronize-issues-to-jira[bot] commented 2 weeks ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-5758.

This message was autogenerated