[DPE-2119] fix issues and tests when reusing storage

reneradoi commented 5 months ago

Issue

When attaching an existing storage to a new unit, 2 issues happen:

Snap install failed because of permissions / ownership of directories
snap_common gets completely deleted

Solution

bump snap version, use the fixed one (the fixed revision is 47, this is already outdated as a newer version of the snap is already available and merged to main prior to this PR)
enhance test coverage for integration tests

Integration Testing

Tests for attaching existing storage can be found in integration/ha/test_storage.py. There are now three test cases:

test_storage_reuse_after_scale_down: remove one unit from the deployment, afterwards add a new one re-using the storage from the removed unit. check if the continuous writes are ok and a testfile that was created intially is still there.
test_storage_reuse_after_scale_to_zero: remove both units from the deployment, keep the application, add two new units using the storage again. check the continuous writes.
test_storage_reuse_in_new_cluster_after_app_removal: from a cluster of three units, remove all of them and remove the application. deploy a new application (with one unit) to the same model, attach the storage, then add two more units with the other storage volumes. check the continuous writes.

Other Issues

As part of this PR, another issue is addressed: https://github.com/canonical/opensearch-operator/issues/306. It is resolved with this commit: https://github.com/canonical/opensearch-operator/pull/272/commits/19f843c201504f89025cc3442cbdc8e8e69bec9f

reneradoi commented 5 months ago

This PR updates the installed snap to the fixed version which can install on existing storage and does not overwrite the data in snap_common.

Other content of this PR:

The integration test test_storage_reuse_after_scale_down is now working. I've:

created a storage pool on lxd
deployed the application with storage from that pool
adjusted the test workflow a little to make it more robust
re-attaching storage to a new unit tests correct now

The test case test_storage_reuse_in_new_cluster_after_app_removal is being skipped. Currently this test fails all the time, I first need to understand how this is supposed to work before fixing this test.

reneradoi commented 4 months ago

The PR is now ready for review again.

reneradoi commented 3 months ago

LGTM. My only remark is that we need to dig more into what is causing the 503.

On a separate PR, we should log 503 errors and its explanation, as follows:
GET _cluster/allocation/explain
{
  "index": ".charm-node-lock",
}

Yes that's true, I'll add another issue or draft PR, will think about it.

canonical / opensearch-operator