I am running an upgrade, which got stuck on the last unit's upgrading. Current status:
Model Controller Cloud/Region Version SLA Timestamp
test-small-deployment-upgrades-d1el localhost-localhost localhost/localhost 3.4.4 unsupported 19:57:46+02:00
App Version Status Scale Charm Channel Rev Exposed Message
opensearch blocked 3 opensearch 0 no Upgrading. Verify highest unit is healthy & run `resume-upgrade` action. To rollback, `juju refresh` to last revision
self-signed-certificates active 1 self-signed-certificates latest/stable 155 no
Unit Workload Agent Machine Public address Ports Message
opensearch/0* active idle 0 10.115.236.8 9200/tcp OpenSearch 2.12.0 running; Snap rev 44 (outdated); Charmed operator 1+8ac22f7+8ac22f7-dirty
opensearch/1 active idle 1 10.115.236.188 9200/tcp OpenSearch 2.12.0 running; Snap rev 44 (outdated); Charmed operator 1+8ac22f7+8ac22f7-dirty
opensearch/2 blocked idle 2 10.115.236.27 9200/tcp Rollback with `juju refresh`. Pre-upgrade check failed: Cluster health is yellow instead of green
self-signed-certificates/0* active idle 3 10.115.236.97
Machine State Address Inst id Base AZ Message
0 started 10.115.236.8 juju-4193d3-0 ubuntu@22.04 Running
1 started 10.115.236.188 juju-4193d3-1 ubuntu@22.04 Running
2 started 10.115.236.27 juju-4193d3-2 ubuntu@22.04 Running
3 started 10.115.236.97 juju-4193d3-3 ubuntu@22.04 Running
When looking at the shards, I can see:
series_index 0 p STARTED 270 1.1mb 10.115.236.188 opensearch-1
series_index 0 r UNASSIGNED
.plugins-ml-config 0 r STARTED 1 3.9kb 10.115.236.188 opensearch-1
.plugins-ml-config 0 p STARTED 1 3.9kb 10.115.236.8 opensearch-0
.opensearch-observability 0 r STARTED 0 208b 10.115.236.188 opensearch-1
.opensearch-observability 0 p STARTED 0 208b 10.115.236.8 opensearch-0
.opensearch-sap-log-types-config 0 r STARTED 10.115.236.188 opensearch-1
.opensearch-sap-log-types-config 0 p STARTED 10.115.236.8 opensearch-0
.opendistro_security 0 r STARTED 10 50.2kb 10.115.236.188 opensearch-1
.opendistro_security 0 p STARTED 10 61.1kb 10.115.236.8 opensearch-0
.charm_node_lock 0 r STARTED 1 3.9kb 10.115.236.188 opensearch-1
.charm_node_lock 0 p STARTED 1 3.9kb 10.115.236.8 opensearch-0
The replica shard is not able to move from previous opensearch-2 to opensearch-1 because the cluster is in the middle of the upgrade process, and hence cannot move anything but primary shards:
{
"index": "series_index",
"shard": 0,
"primary": false,
"current_state": "unassigned",
"unassigned_info": {
"reason": "NODE_LEFT",
"at": "2024-07-04T17:45:23.497Z",
"details": "node_left [pOfiAYQVR2ulZxhU2IZMrg]",
"last_allocation_status": "no_attempt"
},
"can_allocate": "no",
"allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions": [
{
"node_id": "CUMMKezDSoyq7gy-DBheTQ",
"node_name": "opensearch-0",
"transport_address": "10.115.236.8:9300",
"node_attributes": {
"shard_indexing_pressure_enabled": "true"
},
"node_decision": "no",
"deciders": [
{
"decider": "enable",
"decision": "NO",
"explanation": "replica allocations are forbidden due to cluster setting [cluster.routing.allocation.enable=primaries]"
}
]
},
{
"node_id": "mB6bq4wxTZG4LHQXUKODWg",
"node_name": "opensearch-1",
"transport_address": "10.115.236.188:9300",
"node_attributes": {
"shard_indexing_pressure_enabled": "true"
},
"node_decision": "no",
"deciders": [
{
"decider": "enable",
"decision": "NO",
"explanation": "replica allocations are forbidden due to cluster setting [cluster.routing.allocation.enable=primaries]"
},
{
"decider": "same_shard",
"decision": "NO",
"explanation": "a copy of this shard is already allocated to this node [[series_index][0], node[mB6bq4wxTZG4LHQXUKODWg], [P], s[STARTED], a[id=AZbeqvTdRDSggGXzi72hfw]]"
}
]
}
]
}
We should add a flag to the precheck: we should accept "Health.GREEN" and Health.YELLOW" status if we are in the middle of the upgrade; but not if we are running the action right before the upgrade itself.
I am running an upgrade, which got stuck on the last unit's upgrading. Current status:
When looking at the shards, I can see:
The replica shard is not able to move from previous
opensearch-2
toopensearch-1
because the cluster is in the middle of the upgrade process, and hence cannot move anything but primary shards:We should add a flag to the precheck: we should accept "Health.GREEN" and Health.YELLOW" status if we are in the middle of the upgrade; but not if we are running the action right before the upgrade itself.