canonical / opensearch-operator

OpenSearch operator
Apache License 2.0
9 stars 5 forks source link

[DPE-3661] Add support for large deployments backup #248

Closed phvalguima closed 2 months ago

phvalguima commented 2 months ago

Implements DPE-3661: extend backup feature to support large deployment scenarios.

Currently, there are 4x type of scenarios: 1) Small deployments: the cluster performs all the different types of node roles 2) Large deployments - orchestrator: the app is in charge of not only its own application units but also to coordinate the across the different clusters 3) Large deployments - failover orchestrator: very similar to (2), this app must also publish its information in the peer relation, although all the clusters will only listen to the active manager 4) Large deployments - data only: do not perform any management tasks and should receive any relevant information via peer relation

For backups, clusters of type (3) and (4) have a special behavior: they will receive the backup data via peer-cluster relation and should refuse: i. to execute backup-related actions; and ii. to execute the s3-relation events themselves. The latter avoids confusions, e.g. an user inadvertently relates the cluster to different s3-integrators.

The implementation of (1) and (2) are very similar.

It contains the same fix as: https://github.com/canonical/opensearch-operator/pull/253

Adds following fixes related to testing in general:

  1. ContinuousWrites is updated to hold the right count of documents in writes_value internally
  2. Adds an is_burst option to ContinuousWrites: a test may choose to send 100-burst docs vs. doc-by-doc - is_burst defaults to True
  3. The ContinuousWrites terminates its process as part of stop, avoiding stranded process generating docs to ContinuousWrites.INDEX_NAME post a given test
  4. The start_and_check_continuous_writes updated to assert_start_and_check_continuous_writes

How To

Setup a large scale deployment

Deploy a large scale environment with:

juju deploy tls-certificates-operator --channel stable --show-log --verbose
juju config tls-certificates-operator generate-self-signed-certificates=true ca-common-name="CN_CA"

# deploy main-orchestrator cluster 
juju deploy -n 3 ./opensearch.charm \
    main \
    --config cluster_name="log-app" --config init_hold=false --config roles="cluster_manager"

# deploy failover-orchestrator cluster
juju deploy -n 2 ./opensearch.charm \
    failover \
    --config cluster_name="log-app" --config init_hold=true --config roles="cluster_manager"

# deploy data-hot cluster
juju deploy -n 2 ./opensearch.charm \
    data-hot \
    --config cluster_name="log-app" --config init_hold=true --config roles="data.hot"

# integrate TLS
juju integrate tls-certificates-operator main
juju integrate tls-certificates-operator failover
juju integrate tls-certificates-operator data-hot

# integrate the "main"-orchestrator with all clusters:
juju integrate main:peer-cluster-orchestrator failover:peer-cluster
juju integrate main:peer-cluster-orchestrator data-hot:peer-cluster

# integrate the "failover"-orchestrator with rest of clusters:
juju integrate failover:peer-cluster-orchestrator data-hot:peer-cluster

Connect with S3

This step assumes a s3-integrator charm has been successfully deployed and configured. The large deployments backup is set with a s3-integrator connected solely to the main charm. It must be set as follows:

juju integrate s3-integrator main

Wait until the cluster deployment set. Now, backup / restore actions can be executed by running them against the main orchestrator's leader:

juju run main/leader create-backup

In case of failover

In case a failover must be triggered, besides the process described for failover of large deployments, also move the current s3-integrator from one cluster to another:

# If the cluster still exists
juju remove-relation main s3-integrator

# Then, connect with the new cluster manager
juju relate s3-integrator failover

Implementation Details

For developers, there is no meaningful difference between small and large deployments. They both use the same backup_factory() to return the correct object for their case.

The large deployments expands the original concept of OpenSearchBackup to include other juju applications that are not cluster_manager. This means a cluster may be a data-only or even a failover cluster-manager and still interacts with s3-integrator at a certain level.

The baseline is that every unit in the cluster must import the S3 credentials. The main orchestrator will share these credentials via the peer-cluster relation. Failover and data clusters will import that information from the peer-cluster relation.

To implement the points above without causing too much disruption to the existing code, a factory pattern has been adopted, where the main charm receives a OpenSearchBackupBase object that corresponds to its own case (cluster-manager, failover, data, etc). """