`resume-upgrade` fails if highest unit is also the leader unit

phvalguima commented 1 month ago

The resume-upgrade fails with:

Running operation 7 with 1 task
  - task 8 on unit-failover-1

Waiting for task 8...
Action id 8 failed: Highest number unit is unhealthy. Upgrade will not resume.

If the leader unit is running on the unit with the highest identifier.

Using pdb, I can confirm the following, on:

  /var/lib/juju/agents/unit-failover-1/charm/src/charm.py(267)<module>()
-> main(OpenSearchOperatorCharm)
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/main.py(544)main()
-> manager.run()
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/main.py(520)run()
-> self._emit()
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/main.py(509)_emit()
-> _emit_charm_event(self.charm, self.dispatcher.event_name)
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/main.py(143)_emit_charm_event()
-> event_to_emit.emit(*args, **kwargs)
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/framework.py(352)emit()
-> framework._emit(event)
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/framework.py(851)_emit()
-> self._reemit(event_path)
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/framework.py(941)_reemit()
-> custom_handler(event)
  /var/lib/juju/agents/unit-failover-1/charm/src/charm.py(188)_on_resume_upgrade_action()
-> self._upgrade.reconcile_partition(action_event=event)
> /var/lib/juju/agents/unit-failover-1/charm/src/machine_upgrade.py(114)reconcile_partition()
-> unhealthy = state is not upgrade.UnitState.HEALTHY

The charm will fail as state reports:

(Pdb) state
<UnitState.UPGRADING: 'upgrading'>

Full Status:

Model                                Controller           Cloud/Region         Version  SLA          Timestamp
test-large-deployment-upgrades-36oo  localhost-localhost  localhost/localhost  3.4.2    unsupported  16:59:24+02:00

App                       Version  Status   Scale  Charm                               Channel        Rev  Exposed  Message
failover                           blocked      2  opensearch                                           1  no       Upgrading. Verify highest unit is healthy & run `resume-upgrade` action. To rollback, `juju refresh` to last revision
main                               active       1  pguimaraes-opensearch-upgrade-test  latest/edge     19  no       
opensearch                         active       3  opensearch                                           0  no       
self-signed-certificates           active       1  self-signed-certificates            latest/stable   72  no       

Unit                         Workload  Agent      Machine  Public address  Ports     Message
failover/0                   active    idle       0        10.173.208.166  9200/tcp  OpenSearch 2.12.0 running; Snap rev 40 (outdated); Charmed operator 1+3cebf31-dirty+3cebf31-dirty+3cebf31-dirty+3cebf...
failover/1*                  active    executing  1        10.173.208.236  9200/tcp  (resume-upgrade) OpenSearch 2.12.0 running; Snap rev 44; Charmed operator 1+3cebf31-dirty+3cebf31-dirty+3cebf31-dirty...
main/0*                      active    idle       2        10.173.208.119  9200/tcp  
opensearch/0                 active    idle       3        10.173.208.182  9200/tcp  
opensearch/1*                active    idle       4        10.173.208.21   9200/tcp  
opensearch/2                 active    idle       5        10.173.208.245  9200/tcp  
self-signed-certificates/0*  active    idle       6        10.173.208.15             

Machine  State    Address         Inst id        Base          AZ  Message
0        started  10.173.208.166  juju-bb32e7-0  ubuntu@22.04      Running
1        started  10.173.208.236  juju-bb32e7-1  ubuntu@22.04      Running
2        started  10.173.208.119  juju-bb32e7-2  ubuntu@22.04      Running
3        started  10.173.208.182  juju-bb32e7-3  ubuntu@22.04      Running
4        started  10.173.208.21   juju-bb32e7-4  ubuntu@22.04      Running
5        started  10.173.208.245  juju-bb32e7-5  ubuntu@22.04      Running
6        started  10.173.208.15   juju-bb32e7-6  ubuntu@22.04      Running

github-actions[bot] commented 1 month ago

https://warthogs.atlassian.net/browse/DPE-4306

phvalguima commented 1 month ago

I believe we should check if either state is in either UPGRADING or HEALTHY state.

phvalguima commented 1 month ago

Likewise, we have another point where that is a problem here

carlcsaposs-canonical commented 1 month ago

I don't think this is a bug

the highest unit should have upgraded & be healthy before the upgrade is resumed (without force)

carlcsaposs-canonical commented 1 month ago

for history, conclusion: issue (reason why resume-upgrade failed) was

unit-failover-1: 12:39:13 INFO unit.failover/1.juju-log Current health of cluster: ignore
unit-failover-1: 12:39:13 ERROR unit.failover/1.juju-log Cluster is not healthy after upgrade. Manual intervention required. To rollback, `juju refresh` to the previous revision

and cluster health (checked here: https://github.com/canonical/opensearch-operator/blob/6670d19650144de9b08d549554b4cb51bbb3c1f0/lib/charms/opensearch/v0/opensearch_base_charm.py#L985) should not have returned ignore

phvalguima commented 1 month ago

As discussed with @carlcsaposs-canonical the issue was on the self.health.apply and moving to self.health.get .

canonical / opensearch-operator

`resume-upgrade` fails if highest unit is also the leader unit #303