[Upgrade] Intermittent failure: upgrade gets stuck due to cluster in yellow state

phvalguima commented 2 months ago

I am running an upgrade, which got stuck on the last unit's upgrading. Current status:

Model                                Controller           Cloud/Region         Version  SLA          Timestamp
test-small-deployment-upgrades-d1el  localhost-localhost  localhost/localhost  3.4.4    unsupported  19:57:46+02:00

App                       Version  Status   Scale  Charm                     Channel        Rev  Exposed  Message
opensearch                         blocked      3  opensearch                                 0  no       Upgrading. Verify highest unit is healthy & run `resume-upgrade` action. To rollback, `juju refresh` to last revision
self-signed-certificates           active       1  self-signed-certificates  latest/stable  155  no       

Unit                         Workload  Agent  Machine  Public address  Ports     Message
opensearch/0*                active    idle   0        10.115.236.8    9200/tcp  OpenSearch 2.12.0 running; Snap rev 44 (outdated); Charmed operator 1+8ac22f7+8ac22f7-dirty
opensearch/1                 active    idle   1        10.115.236.188  9200/tcp  OpenSearch 2.12.0 running; Snap rev 44 (outdated); Charmed operator 1+8ac22f7+8ac22f7-dirty
opensearch/2                 blocked   idle   2        10.115.236.27   9200/tcp  Rollback with `juju refresh`. Pre-upgrade check failed: Cluster health is yellow instead of green
self-signed-certificates/0*  active    idle   3        10.115.236.97             

Machine  State    Address         Inst id        Base          AZ  Message
0        started  10.115.236.8    juju-4193d3-0  ubuntu@22.04      Running
1        started  10.115.236.188  juju-4193d3-1  ubuntu@22.04      Running
2        started  10.115.236.27   juju-4193d3-2  ubuntu@22.04      Running
3        started  10.115.236.97   juju-4193d3-3  ubuntu@22.04      Running

When looking at the shards, I can see:

series_index                     0 p STARTED    270  1.1mb 10.115.236.188 opensearch-1
series_index                     0 r UNASSIGNED
.plugins-ml-config               0 r STARTED      1  3.9kb 10.115.236.188 opensearch-1
.plugins-ml-config               0 p STARTED      1  3.9kb 10.115.236.8   opensearch-0
.opensearch-observability        0 r STARTED      0   208b 10.115.236.188 opensearch-1
.opensearch-observability        0 p STARTED      0   208b 10.115.236.8   opensearch-0
.opensearch-sap-log-types-config 0 r STARTED               10.115.236.188 opensearch-1
.opensearch-sap-log-types-config 0 p STARTED               10.115.236.8   opensearch-0
.opendistro_security             0 r STARTED     10 50.2kb 10.115.236.188 opensearch-1
.opendistro_security             0 p STARTED     10 61.1kb 10.115.236.8   opensearch-0
.charm_node_lock                 0 r STARTED      1  3.9kb 10.115.236.188 opensearch-1
.charm_node_lock                 0 p STARTED      1  3.9kb 10.115.236.8   opensearch-0

The replica shard is not able to move from previous opensearch-2 to opensearch-1 because the cluster is in the middle of the upgrade process, and hence cannot move anything but primary shards:

{
  "index": "series_index",
  "shard": 0,
  "primary": false,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "NODE_LEFT",
    "at": "2024-07-04T17:45:23.497Z",
    "details": "node_left [pOfiAYQVR2ulZxhU2IZMrg]",
    "last_allocation_status": "no_attempt"
  },
  "can_allocate": "no",
  "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions": [
    {
      "node_id": "CUMMKezDSoyq7gy-DBheTQ",
      "node_name": "opensearch-0",
      "transport_address": "10.115.236.8:9300",
      "node_attributes": {
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "enable",
          "decision": "NO",
          "explanation": "replica allocations are forbidden due to cluster setting [cluster.routing.allocation.enable=primaries]"
        }
      ]
    },
    {
      "node_id": "mB6bq4wxTZG4LHQXUKODWg",
      "node_name": "opensearch-1",
      "transport_address": "10.115.236.188:9300",
      "node_attributes": {
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "enable",
          "decision": "NO",
          "explanation": "replica allocations are forbidden due to cluster setting [cluster.routing.allocation.enable=primaries]"
        },
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "a copy of this shard is already allocated to this node [[series_index][0], node[mB6bq4wxTZG4LHQXUKODWg], [P], s[STARTED], a[id=AZbeqvTdRDSggGXzi72hfw]]"
        }
      ]
    }
  ]
}

We should add a flag to the precheck: we should accept "Health.GREEN" and Health.YELLOW" status if we are in the middle of the upgrade; but not if we are running the action right before the upgrade itself.

github-actions[bot] commented 2 months ago

https://warthogs.atlassian.net/browse/DPE-4836

phvalguima commented 1 month ago

After internal discussion, it was agreed that we must have all-green cluster.

canonical / opensearch-operator

[Upgrade] Intermittent failure: upgrade gets stuck due to cluster in yellow state #350