canonical / opensearch-operator

OpenSearch operator
Apache License 2.0
9 stars 5 forks source link

Upgrade fails if the leader is the highest unit #292

Closed phvalguima closed 2 months ago

phvalguima commented 2 months ago

If the leader is the highest unit in the cluster, then the resume-upgrade will fail with: Highest number unit is unhealthy. Upgrade will not resume.

The reason is because the leader will, being the highest unit, already did its own upgrade and moved from UnitState.HEALTHY to UnitState.UPGRADING:

-> if outdated or unhealthy:
(Pdb) l
107                 outdated = (
108                     self._unit_workload_container_versions.get(unit.name)
109                     != self._app_workload_container_version
110                 )
111                 unhealthy = state is not upgrade.UnitState.HEALTHY
112  ->             if outdated or unhealthy:
113                     if outdated:
114                         message = "Highest number unit has not upgraded yet. Upgrade will not resume."
115                     else:
116                         message = "Highest number unit is unhealthy. Upgrade will not resume."
117                     logger.debug(f"Resume upgrade event failed: {message}")
(Pdb) p unhealthy
True
(Pdb) p state
<UnitState.UPGRADING: 'upgrading'>

The check should be instead:

unhealthy = state not in [upgrade.UnitState.HEALTHY, upgrade.UnitState.UPGRADING]
github-actions[bot] commented 2 months ago

https://warthogs.atlassian.net/browse/DPE-4283

phvalguima commented 2 months ago

I could not reproduce this issue on another run, debug logs attached. debug-log.txt