canonical / opensearch-operator

OpenSearch operator
Apache License 2.0
12 stars 7 forks source link

Enable cluster manager only nodes #424

Closed reneradoi closed 2 months ago

reneradoi commented 2 months ago

Issue

Currently we always add the data role to a node if it is cluster-manager. This is required because otherwise the security index could not be initialized directly after startup of the first node.

Solution

This PR provides a solution for enabling "cluster-manager-only" nodes in large deployments. The workaround for adding the data role by default is removed.

The solution is implemented according this workflow:

For this, the data model of the PeerClusterApp is adjusted and the roles of the application are added, in order to be able to check if there is any data role in the entire cluster fleet (can be queried with ClusterTopology.data_role_in_cluster_fleet_apps())

reneradoi commented 2 months ago

Hey @Mehdi-Bendriss thank you for the feedback! This is very valuable, as the current design is different than your expections given below.

Currently, when no data node is there, the cluster-manager does get started, but without running the post_start_init. That means it is there, but not initialized. The start hook will get deferred until the first data node is up and initialized (or endlessly, if that doesn't happen).

When a data node joins, it will currently start independently, initialize the security index and on that start, join the cluster (by contacting the already started cluster-manager). On the next re-emitted start hook, the cluster-manager will then fully come up.

I've chosen this design because I found the current code has a lot of situations on startup where, if the node is not fully up, the start event gets deferred and the update to the large deployment relation didn't happen. That's why I wanted to let them start independently and let the data node find the cluster once up.

I will try to rework that according to your given specification below.

Thanks!

Thanks René. I have mainly questions about synchronisation of the start sequence of all apps / nodes.

Can you confirm the following flow when there is no data node in the cluster:

  1. when no data node in the application:

    1. no opensearch starts
    2. the start hook gets deferred endlessly
  2. when a data node joins - through large deployment relations:

    1. we become aware of it through the fleet_apps object
    2. we then start the leader unit of the main orchestrator
    3. which then notifies the large deployment relations
    4. the leader unit of the data cluster starts too
    5. start flow of the rest of the fleet resumes as usual
reneradoi commented 2 months ago

Hey @juditnovak! Thanks for the comment. I agree, these are not just "minor tweaks" or code adjustments. Sometimes this can get lost when being deeply invested into a topic.

Could we provide at least a base description of the logic? I.e. documentation within the code as done before (CA rotation).

[...]

If at least we could hold on to the earlier objective, that each PR would add 2 unittest to the testsuite... (I volunteer to add them if no other way.)

I've added a comment explaining the workflow at where it starts, and also added a few unit tests to document the changed behaviour. Hope this is fine for you.

skourta commented 2 months ago

Nice work @reneradoi . I tested the workflow and it follows exactly what @Mehdi-Bendriss described. I have a couple of notes:

  1. Once the main cluster manager node is up and running we add a failover node. The failover node status is set to "waiting" with the requesting lock message. This is misleading as it is actually blocked waiting for data nodes to join and the main cluster manager to be initialized.
  2. When you deploy the data nodes they go in an active state/idle state with no message. This is also misleading as they are waiting to be integrated with the cluster manager. I think we should change the state and add a message clarifying what is happening.
reneradoi commented 2 months ago

Nice work @reneradoi . I tested the workflow and it follows exactly what @Mehdi-Bendriss described. I have a couple of notes:

  1. Once the main cluster manager node is up and running we add a failover node. The failover node status is set to "waiting" with the requesting lock message. This is misleading as it is actually blocked waiting for data nodes to join and the main cluster manager to be initialized.
  2. When you deploy the data nodes they go in an active state/idle state with no message. This is also misleading as they are waiting to be integrated with the cluster manager. I think we should change the state and add a message clarifying what is happening.

Hey @skourta thank you for your review! It's good that you deployed it and especially watched the status and messages!

The currently expected status and messages are documented in the integration test here:

        apps_full_statuses={
            MAIN_APP: {"blocked": [PClusterNoDataNode]},
            FAILOVER_APP: {"blocked": [PClusterNoRelation]},
            DATA_APP: {"blocked": [PClusterNoRelation]},
        },
        units_full_statuses={
            MAIN_APP: {"units": {"blocked": [PClusterNoDataNode]}},
            FAILOVER_APP: {"units": {"active": []}},
            DATA_APP: {"units": {"active": []}},
        }

The blocking of data and failover applications are shown on the app status (this was discussed with @Mehdi-Bendriss earlier). This status persists until the applications are related to another with the peer-cluster-relation. Having the same message on unit status would be redundant, from my point of view.