canonical / opensearch-operator

OpenSearch operator
Apache License 2.0
9 stars 5 forks source link

Shards don't get assigned when the Primary get's removed and only two units are left #327

Open reneradoi opened 3 weeks ago

reneradoi commented 3 weeks ago

Steps to reproduce

Expected behavior

If there are only two units left, one should become the primary for the unassigned shards.

Actual behavior

No units takes over the primary role for the unassigned shards, thus esp. the index .charm_node_lock may stay unavailable and no further locks for scaling down (or up) can be acquired.

Log output

Juju debug log:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 272, in request
    resp = call(urls[0])
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 224, in call
    for attempt in Retrying(
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/tenacity/__init__.py", line 347, in __iter__
    do = self.iter(retry_state=retry_state)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/tenacity/__init__.py", line 325, in iter
    raise retry_exc.reraise()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/tenacity/__init__.py", line 158, in reraise
    raise self.last_attempt.result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 251, in call
    response.raise_for_status()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://192.168.235.252:9200/.charm_node_lock/_source/0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-1/charm/./src/charm.py", line 94, in <module>
    main(OpenSearchOperatorCharm)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/main.py", line 544, in main
    manager.run()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/main.py", line 520, in run
    self._emit()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/main.py", line 509, in _emit
    _emit_charm_event(self.charm, self.dispatcher.event_name)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/main.py", line 143, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/framework.py", line 352, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/framework.py", line 851, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/framework.py", line 941, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 467, in _on_opensearch_data_storage_detaching
    self.node_lock.release()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 327, in release
    if self._unit_with_lock(host) == self._charm.unit.name:
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 199, in _unit_with_lock
    document_data = self._opensearch.request(
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 284, in request
    raise OpenSearchHttpError(
charms.opensearch.v0.opensearch_exceptions.OpenSearchHttpError: HTTP error self.response_code=503
self.response_body={'error': {'root_cause': [{'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}], 'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}, 'status': 503}

Additional context

Also see this issue, where some investigations and workarounds are discussed.

github-actions[bot] commented 3 weeks ago

https://warthogs.atlassian.net/browse/DPE-4603

phvalguima commented 6 days ago

This could be linked to #324

reneradoi commented 1 day ago

Suggested resolution to this: when nodes are removed from the peer relation and only two nodes remain, one of them should be added to the voting exclusions in order to avoid split brain situations.