canonical / opensearch-operator

OpenSearch operator
Apache License 2.0
9 stars 5 forks source link

Lock is not released when scaling down to zero units #309

Closed reneradoi closed 1 month ago

reneradoi commented 1 month ago

Steps to reproduce

Expected behavior

both new units start up correctly

Actual behavior

2nd new unit can't aquire lock and does not start

$ juju status --relations --storage
Model       Controller  Cloud/Region         Version  SLA          Timestamp
opensearch  opensearch  localhost/localhost  3.1.8    unsupported  08:08:21Z

App                       Version  Status  Scale  Charm                     Channel  Rev  Exposed  Message
opensearch                         active      2  opensearch                           0  no       
self-signed-certificates           active      1  self-signed-certificates  stable    72  no       

Unit                         Workload  Agent  Machine  Public address  Ports     Message
opensearch/2*                active    idle   3        10.27.170.215   9200/tcp  
opensearch/3                 waiting   idle   4        10.27.170.204             Requesting lock on operation: start
self-signed-certificates/0*  active    idle   2        10.27.170.207             

Machine  State    Address        Inst id        Base          AZ  Message
2        started  10.27.170.207  juju-a1b76d-2  ubuntu@22.04      Running
3        started  10.27.170.215  juju-a1b76d-3  ubuntu@22.04      Running
4        started  10.27.170.204  juju-a1b76d-4  ubuntu@22.04      Running

Integration provider                   Requirer                       Interface           Type     Message
opensearch:node-lock-fallback          opensearch:node-lock-fallback  node_lock_fallback  peer     
opensearch:opensearch-peers            opensearch:opensearch-peers    opensearch_peers    peer     
opensearch:upgrade-version-a           opensearch:upgrade-version-a   upgrade             peer     
self-signed-certificates:certificates  opensearch:certificates        tls-certificates    regular  

Storage Unit  Storage ID         Type        Pool                Mountpoint                   Size     Status    Message
opensearch/2  opensearch-data/1  filesystem  opensearch-storage  /var/snap/opensearch/common  1.0 GiB  attached  
opensearch/3  opensearch-data/0  filesystem  opensearch-storage  /var/snap/opensearch/common  1.0 GiB  attached  

Versions

Operating system: Ubuntu 24.04 LTS, Ubuntu 22.04 LTS Juju CLI: 3.1.8-genericlinux-amd64 Juju agent: 3.1.8 Charm revision: 47

Log output

Juju debug log: nothing relevant in the logs

Additional context

$ jhack show-relation opensearch:node-lock-fallback opensearch:node-lock-fallback
                               relation data v0.6                               
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ peer relation (id: 1) ┃ opensearch                                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ type                  │ peer                                                 │
│ interface             │ node_lock_fallback                                   │
│ model                 │ the current model                                    │
│ relation ID           │ 1                                                    │
│ endpoint              │ node-lock-fallback                                   │
│ leader unit           │ 2                                                    │
├───────────────────────┼──────────────────────────────────────────────────────┤
│ application data      │ ╭──────────────────────────────────────────────────╮ │
│                       │ │                                                  │ │
│                       │ │  unit-with-lock  opensearch/3                    │ │
│                       │ ╰──────────────────────────────────────────────────╯ │
│ unit data             │ ╭─ opensearch/opensearch/2 ─╮                        │
│                       │ │ <empty>                   │                        │
│                       │ ╰───────────────────────────╯                        │
│                       │ ╭─ opensearch/opensearch/3 ─╮                        │
│                       │ │                           │                        │
│                       │ │  lock-requested  true     │                        │
│                       │ ╰───────────────────────────╯                        │
└───────────────────────┴──────────────────────────────────────────────────────┘

The lock in the database is still there, even though the unit opensearch/1 is long gone:

$ curl -XGET "https://admin:[password]@10.27.170.215:9200/.charm_node_lock/_source/0" -k
{"unit-name": "opensearch/1"}

This could be related to this known limitation of the locking mechanism: https://github.com/canonical/opensearch-operator/blob/main/lib/charms/opensearch/v0/opensearch_locking.py#L334-L335

github-actions[bot] commented 1 month ago

https://warthogs.atlassian.net/browse/DPE-4421

carlcsaposs-canonical commented 1 month ago

I believe this issue was caused by this change: https://github.com/canonical/opensearch-operator/pull/279/files#diff-f0ef92b3c155cc488340483aeb7a14f44e6577ccb3a328d8856fab4984072e57L248-R248

Juju debug log: nothing relevant in the logs

but it's hard to tell without debug-level logs (i.e. "[Node lock]" logs)

reneradoi commented 1 month ago

Was solved with https://github.com/canonical/opensearch-operator/pull/312