canonical / charm-rolling-ops

Apache License 2.0
1 stars 2 forks source link

[Red Herring Potential] Non-Leader Unit is running ProcessLocks #16

Open phvalguima opened 8 months ago

phvalguima commented 8 months ago

In run: https://github.com/canonical/opensearch-operator/actions/runs/8032823682/job/21942692208

I can see the following status:

Model                         Controller           Cloud/Region         Version  SLA          Timestamp
test-horizontal-scaling-ri5x  github-pr-1b962-lxd  localhost/localhost  3.1.7    unsupported  21:23:10Z

App                       Version  Status  Scale  Charm                     Channel  Rev  Exposed  Message
opensearch                         active      5  opensearch                           0  no       
self-signed-certificates           active      1  self-signed-certificates  stable    72  no       

Unit                         Workload  Agent  Machine  Public address  Ports  Message
opensearch/3*                active    idle   4        10.217.138.76          
opensearch/4                 active    idle   5        10.217.138.124         
opensearch/6                 active    idle   7        10.217.138.64          
opensearch/7                 waiting   idle   8        10.217.138.239         Awaiting service operation
opensearch/8                 active    idle   9        10.217.138.160         
self-signed-certificates/0*  active    idle   0        10.217.138.75          

Machine  State    Address         Inst id        Base          AZ  Message
0        started  10.217.138.75   juju-29c99e-0  ubuntu@22.04      Running
4        started  10.217.138.76   juju-29c99e-4  ubuntu@22.04      Running
5        started  10.217.138.124  juju-29c99e-5  ubuntu@22.04      Running
7        started  10.217.138.64   juju-29c99e-7  ubuntu@22.04      Running
8        started  10.217.138.239  juju-29c99e-8  ubuntu@22.04      Running
9        started  10.217.138.160  juju-29c99e-9  ubuntu@22.04      Running

Integration provider                   Requirer                     Interface         Type     Message
opensearch:opensearch-peers            opensearch:opensearch-peers  opensearch_peers  peer     
opensearch:service                     opensearch:service           rolling_op        peer     
self-signed-certificates:certificates  opensearch:certificates      tls-certificates  regular  

Storage Unit  Storage ID         Type        Pool    Mountpoint                   Size     Status    Message
opensearch/3  opensearch-data/3  filesystem  rootfs  /var/snap/opensearch/common  145 GiB  attached  
opensearch/4  opensearch-data/4  filesystem  rootfs  /var/snap/opensearch/common  145 GiB  attached  
opensearch/6  opensearch-data/6  filesystem  rootfs  /var/snap/opensearch/common  145 GiB  attached  
opensearch/7  opensearch-data/7  filesystem  rootfs  /var/snap/opensearch/common  145 GiB  attached  
opensearch/8  opensearch-data/8  filesystem  rootfs  /var/snap/opensearch/common  145 GiB  attached  

The unit opensearch/7 is never promoted to a leader: https://pastebin.ubuntu.com/p/wBgdY8m4Vz/

However, I can see in its logs:

unit-opensearch-7: 2024-02-24 20:49:49 DEBUG unit.opensearch/7.juju-log service:0: Deferring <RunWithLock via OpenSearchOperatorCharm/on/service_run_with_lock[244]>.
unit-opensearch-7: 2024-02-24 20:49:49 DEBUG unit.opensearch/7.juju-log service:0: Emitting custom event <ProcessLocks via OpenSearchOperatorCharm/on/service_process_locks[245]>.

That happens because this logic: https://github.com/canonical/charm-rolling-ops/blob/4bae5c031cb7a8d5fd3819ace1f1496c87c0aae4/lib/charms/rolling_ops/v0/rollingops.py#L408

Compares the lock created within RunWithLock, which is Lock(self).unit (own unit) - with self.model.unit. According to the docstring in the operator: https://github.com/canonical/operator/blob/1836df5affb42b3183125b1904c794090aa1862b/ops/model.py#L132

The unit that is running this code

Therefore, that should always be true.

I consider this error a red-herring only because the process locks routine checks if this is a leader, right at its beginning. So, the only real issue is unneeded events in all the other units.

github-actions[bot] commented 8 months ago

https://warthogs.atlassian.net/browse/DPE-3668