One or more replica shards...

canonical / opensearch-operator

OpenSearch operator

Apache License 2.0

11 stars 6 forks source link

One or more replica shards... #324

Closed juditnovak closed 1 month ago

juditnovak commented 3 months ago

Steps to reproduce

Particularly loaded host system

Check out opensearch-dsahboards-operator and run pipeline:

tox run -e integration -- tests/integration/test_upgrade.py --model testing --keep-models

Expected behavior

No errorrs

Actual behavior

See attached screenshots. The problem was permanent, the system didn't recover state (as timestamp on the top indicates).

Screenshot from 2024-06-07 15-31-41

Screenshot from 2024-06-07 15-45-11

Versions

Operating system: jammy

Juju CLI: 3.1.8-genericlinux-amd64

Juju agent: 3.1.8

Charm revision: Most likely 90 or 99 (in case caching may be applied on charmhub, 98 has a chance too)

LXD: 5.0.3 (?)

Log output

Screenshot from 2024-06-07 15-47-50

Additional context

github-actions[bot] commented 3 months ago

https://warthogs.atlassian.net/browse/DPE-4575

phvalguima commented 3 months ago

I am seeing the same problem with upgrades. I believe this is caused by GH runner disk usage and opensearch's disk watermark threshold when allocating unassigned shards. Check this comment: https://github.com/canonical/opensearch-operator/pull/319#issuecomment-2156177690

phvalguima commented 3 months ago

Sorry, the merge above should've not close this issue. I want to investigate it further.

phvalguima commented 3 months ago

Hi @juditnovak I tried twice this test scenario and cannot reproduce it in my own machine. If you are able to reproduce, can you provide two information:

Shard status: curl -sk -u admin:<PWD> https://<IP>:9200/_cat/shards
Cluster allocation explain, specially for any unassigned shards seen above: curl -XGET -H 'Content-Type: application/json' -sk -u admin:<PWD> https://<IP>:9200/_cluster/allocation/explain -d '{ "index": "TARGET_INDEX" }'

juditnovak commented 3 months ago

Sure, I'll totally do that. I foresee running similar pipelines locally quite a bit, so we can confirm if the issues occurs again.

phvalguima commented 3 months ago

Thanks @juditnovak. Let's leave this issue open for now, so we can come back here if we ever see this same issue happening somewhere

juditnovak commented 2 months ago

This issue is still going on as of today (rev 120). It has actually got worse :-(

juditnovak commented 2 months ago

Even worse... IT's happening for 3-unit installations :-( (Latest revision still 120)

https://github.com/canonical/opensearch-dashboards-operator/actions/runs/10212627790/job/28256884240#step:26:112

reneradoi commented 1 month ago

Was fixed with https://github.com/canonical/opensearch-operator/pull/387, the operator will now wait for all shards to be moved to other nodes before shutting down Opensearch.