Closed phvalguima closed 3 months ago
Only solution was to remove machines with --force
. Using force=True neither with remove-application nor remove-unit work.
@phvalguima can you provide the full DEBUG-level log? without it, it's not possible to determine if this is a lock issue or another issue
Tried reproducing with deploying 3 units and removing application
results:
first unit scales down correctly
second unit acquires lock
second unit fails to release lock. guess: last remaining unit is not online? or something else is causing no_shard_available_action_exception
_cluster/health endpoint shows
{
"cluster_name": "opensearch-phav",
"status": "red",
"timed_out": false,
"number_of_nodes": 1,
"number_of_data_nodes": 1,
"discovered_master": true,
"discovered_cluster_manager": true,
"active_primary_shards": 2,
"active_shards": 2,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 3,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 40.0
}
_nodes endpoint shows
.charm_node_lock/_source/0 endpoint shows
{
"error": {
"root_cause": [
{
"type": "no_shard_available_action_exception",
"reason": "No shard available for [get [.charm_node_lock][0]: routing [null]]"
}
],
"type": "no_shard_available_action_exception",
"reason": "No shard available for [get [.charm_node_lock][0]: routing [null]]"
},
"status": 503
}
.charm_node_lock endpoint shows
{
".charm_node_lock": {
"aliases": {},
"mappings": {
"properties": {
"unit-name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"settings": {
"index": {
"replication": {
"type": "DOCUMENT"
},
"number_of_shards": "1",
"auto_expand_replicas": "0-all",
"provided_name": ".charm_node_lock",
"creation_date": "1713856271230",
"number_of_replicas": "0",
"uuid": "0A75oySnQIiw0ur0XdfTbw",
"version": {
"created": "136337827"
}
}
}
}
}
theory: root cause of issue could be that when scaling from 2 -> 1 units new cluster manager is not elected since solution for #230 (described here: https://chat.canonical.com/canonical/pl/ahuxubh4u7dbprufehp5s81s5r) not implemented
more info: https://chat.canonical.com/canonical/pl/jzu7nqu5n7gdmq3joufsgft5oh
I did some digging, and this is essentially related to the nodes allocation exclusion on the departing node holding the primary shard of the locking index. The charm is too fast in the stop procedures for OpenSearch to take the allocation exclusion into consideration and have the primary shard of the locking index moved immediately ==> primary shard lost ==> cluster health red
This is addressed and fixed in https://github.com/canonical/opensearch-operator/pull/175 - where we block until the allocation exclusion fully takes effect
Able to reliably produce by:
curl --insecure -XGET https://admin:password@10.139.243.53:9200/_cat/shards/.charm_node_lock
juju remove-application opensearch
all units except debug-hooks unit will successfully shutdown & release opensearch lock
curl --insecure -XGET https://admin:password@10.139.243.53:9200/_cat/shards/.charm_node_lock
will show that primary shard is unassigned
exit
if event successful. repeat. if event not successful, run ./dispatch again (to simulate juju retry)you'll see an error
2024-04-29 12:32:14,242 ERROR Error checking which unit has OpenSearch lock
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 272, in request
resp = call(urls[0])
File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 224, in call
for attempt in Retrying(
File "/var/lib/juju/agents/unit-opensearch-0/charm/venv/tenacity/__init__.py", line 347, in __iter__
do = self.iter(retry_state=retry_state)
File "/var/lib/juju/agents/unit-opensearch-0/charm/venv/tenacity/__init__.py", line 325, in iter
raise retry_exc.reraise()
File "/var/lib/juju/agents/unit-opensearch-0/charm/venv/tenacity/__init__.py", line 158, in reraise
raise self.last_attempt.result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 251, in call
response.raise_for_status()
File "/var/lib/juju/agents/unit-opensearch-0/charm/venv/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://10.139.243.53:9200/.charm_node_lock/_source/0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 297, in acquired
unit = self._unit_with_lock(host)
File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 199, in _unit_with_lock
document_data = self._opensearch.request(
File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 284, in request
raise OpenSearchHttpError(
charms.opensearch.v0.opensearch_exceptions.OpenSearchHttpError: HTTP error self.response_code=503
self.response_body={'error': {'root_cause': [{'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}], 'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}, 'status': 503}
reproduced using #263 PR branch
also worth noting that the cluster manager switchover worked fine from 2 -> 1 units in this test (confirmed with _cat/cluster_manager
)
Before remove-application
$ curl --insecure -XGET https://admin:password@10.139.243.53:9200/_cat/shards/.charm_node_lock
.charm_node_lock 0 p STARTED 0 15.6kb 10.139.243.193 opensearch-1
.charm_node_lock 0 r STARTED 0 15.6kb 10.139.243.53 opensearch-0
.charm_node_lock 0 r STARTED 0 14.6kb 10.139.243.205 opensearch-2
$ curl --insecure -XGET https://admin:password@10.139.243.53:9200/_cat/cluster_manager
4yCZHb-tSeacR8TLRpTNmA 10.139.243.193 10.139.243.193 opensearch-1
When two units left
$ curl --insecure -XGET https://admin:password@10.139.243.53:9200/_cat/shards/.charm_node_lock
.charm_node_lock 0 p STARTED 0 23.7kb 10.139.243.193 opensearch-1
.charm_node_lock 0 r UNASSIGNED
$ curl --insecure -XGET https://admin:password@10.139.243.53:9200/_cat/cluster_manager
4yCZHb-tSeacR8TLRpTNmA 10.139.243.193 10.139.243.193 opensearch-1
After remove-application
and one unit remaining
$ curl --insecure -XGET https://admin:password@10.139.243.53:9200/_cat/shards/.charm_node_lock
.charm_node_lock 0 p UNASSIGNED
$ curl --insecure -XGET https://admin:password@10.139.243.53:9200/_cat/cluster_manager
Yx7fjlKoQh-BNZyL1juprw 10.139.243.53 10.139.243.53 opensearch-0
(debug-hooks ran on unit 0)
Trying to reproduce the issue with the current main (revision 95 in channel 2/edge) did not show the error anymore. Assumption is it was resolved with one of the recent fixes, presumably https://github.com/canonical/opensearch-operator/pull/312.
So, #312 will affect the lock release. Here, the issue was reachng out to the cluster so we can acquire the lock, or check its state.
I think what may have changed now and then is the error handling when issue a request.
I think what may have changed now and then is the error handling when issue a request.
Hi @phvalguima Could you please explain what you mean with that?
I've further investigated here, especially the error seen in juju debug-logs
:
charms.opensearch.v0.opensearch_exceptions.OpenSearchHttpError: HTTP error self.response_code=503
self.response_body={'error': {'root_cause': [{'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}], 'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}, 'status': 503}
This error happens here: https://github.com/canonical/opensearch-operator/blob/main/lib/charms/opensearch/v0/opensearch_locking.py#L241-L243
Instead of returning False
when it can't check the node locks (in OpenSearch), shouldn't it instead try to aquire the peer databag lock as a fallback?
I've tested this now locally, and it seems to be working fine. What do you think @phvalguima ?
With https://github.com/canonical/opensearch-operator/pull/272 we added a workaround to avoid this deadlock situation when removing an application. The observed behaviour is now no longer keeping the application from being removed.
Nevertheless, the root cause will need to be investigated further. This will be done in https://github.com/canonical/opensearch-operator/issues/327.
Seems it is not possible to remove opensearch application without
--force
anymore. In the end, it ends with 2x openseach units, both in error with:Full logs: https://pastebin.ubuntu.com/p/vHxJX9rWdr/
Core error being: