canonical / opensearch-operator

OpenSearch operator
Apache License 2.0
11 stars 6 forks source link

[DPE-2119] fix issues and tests when reusing storage #272

Closed reneradoi closed 3 months ago

reneradoi commented 5 months ago

Issue

When attaching an existing storage to a new unit, 2 issues happen:

Solution

Integration Testing

Tests for attaching existing storage can be found in integration/ha/test_storage.py. There are now three test cases:

  1. test_storage_reuse_after_scale_down: remove one unit from the deployment, afterwards add a new one re-using the storage from the removed unit. check if the continuous writes are ok and a testfile that was created intially is still there.
  2. test_storage_reuse_after_scale_to_zero: remove both units from the deployment, keep the application, add two new units using the storage again. check the continuous writes.
  3. test_storage_reuse_in_new_cluster_after_app_removal: from a cluster of three units, remove all of them and remove the application. deploy a new application (with one unit) to the same model, attach the storage, then add two more units with the other storage volumes. check the continuous writes.

Other Issues

As part of this PR, another issue is addressed: https://github.com/canonical/opensearch-operator/issues/306. It is resolved with this commit: https://github.com/canonical/opensearch-operator/pull/272/commits/19f843c201504f89025cc3442cbdc8e8e69bec9f

reneradoi commented 5 months ago

This PR updates the installed snap to the fixed version which can install on existing storage and does not overwrite the data in snap_common.

Other content of this PR:

The integration test test_storage_reuse_after_scale_down is now working. I've:

The test case test_storage_reuse_in_new_cluster_after_app_removal is being skipped. Currently this test fails all the time, I first need to understand how this is supposed to work before fixing this test.

reneradoi commented 4 months ago

The PR is now ready for review again.

reneradoi commented 3 months ago

LGTM. My only remark is that we need to dig more into what is causing the 503.

On a separate PR, we should log 503 errors and its explanation, as follows:

GET _cluster/allocation/explain
{
  "index": ".charm-node-lock",
}

Yes that's true, I'll add another issue or draft PR, will think about it.