[RFE] Simplifying `_on_start` and removing deferrals

Currently, _on_start is a method called when the charm starts the very first time or whenever we have reboot of the underlying machine.

I will focus on the first time start only. The method itself grew to have a lot of logic and dependencies in it that stretchs far beyond the actual start routine. The current _on_start will only really finish when we have TLS available and configured + peer cluster manager details. These two information have their own routines and could leave somewhere else.

Below, you will see my review for the method code + comments for each section right underneath those sections.

    def _on_start(self, event: StartEvent):  # noqa C901   ## <<<<<<< this MUST go :(
        """Triggered when on start. Set the right node role."""

        def cleanup():
            if self.peers_data.get(Scope.APP, "security_index_initialised"):
                # in the case where it was on WaitingToStart status, event got deferred
                # and the service started in between, put status back to active
                self.status.clear(WaitingToStart)
                self.status.clear(PClusterNoDataNode)

So, the first part has a check if we have set sercurity_index_initialised and executes two status clean-ups. This logic can be moved down to when we set this parameter in the peer databag.

The next part:

            # cleanup bootstrap conf in the node if existing
            if self.peers_data.get(Scope.UNIT, "bootstrap_contributor"):
                self._cleanup_bootstrap_conf_if_applies()

Although the bootstrap_contributor is part of the initialization logic, hence makes sense to be operated by the start, I'd recommend to have a separated event dedicated to do this clean-up of the config. Reason being that we can then have a simpler event handler that either cleans up the configs and restarts or defers itself.

        if self.opensearch.is_node_up():
            cleanup()
            return

This should not exist on the start, as we can only start the service post- TLS setup. The is_node_up(), logic should be entirely removed and we do the (1) status cleanup at peer events; and (2) the bootstrap config clean up should be a separated event, that only handles that specific task.

        elif (
            self.peers_data.get(Scope.UNIT, "started")
            and "cluster_manager" in self.opensearch.roles
            and not self.opensearch.is_service_started()
        ):
            # Routine executed only once at node reboot
            ...

There is no defer here and it is only called if we had a reboot.

        # apply the directives computed and emitted by the peer cluster manager
        if not self._apply_peer_cm_directives_and_check_if_can_start():
            event.defer()
            return

This logic is only executed once we have the deployment-description available. Hence, we can only execute it post peer-cluster or peer events.

        if not self.is_admin_user_configured() or not self.tls.is_fully_configured():
            if not self.model.get_relation("certificates"):
                status = BlockedStatus(TLSRelationMissing)
            else:
                status = MaintenanceStatus(
                    TLSNotFullyConfigured
                    if self.is_admin_user_configured()
                    else AdminUserNotConfigured
                )
            self.status.set(status)
            event.defer()
            return

I think we should keep this logic, but instead, simply check: (1) not self.is_admin_user_configured()? then we set its status; and (2) if we pass check 1., then check not self.tls.is_fully_configured() and set the appropriate status.

        self.status.clear(AdminUserNotConfigured)
        self.status.clear(TLSNotFullyConfigured)
        self.status.clear(TLSRelationMissing)

These clean-ups should be executed by their own routines (i.e. whoever sets the admin user, TLS and checks for TLS relation respectively). The _on_start only runs a if/elif/else check and sets the most pressing status at that moment.

        # Since system users are initialized, we should take them to local internal_users.yml
        # Leader should be done already
        if not self.unit.is_leader():
            self._purge_users()
            for user in OpenSearchSystemUsers:
                self._put_or_update_internal_user_unit(user)

        # configure clients auth
        self.opensearch_config.set_client_auth()

This should be executed with the routines responsible for the user handling, not here.

        deployment_desc = self.opensearch_peer_cm.deployment_desc()
        # only start the main orchestrator if a data node is available
        # this allows for "cluster-manager-only" nodes in large deployments
        # workflow documentation:
        # no "data" role in deployment desc -> start gets deferred
        # when "data" node joins -> start cluster-manager via _on_peer_cluster_relation_changed
        # cluster-manager notifies "data" node via refresh of peer cluster relation data
        # "data" node starts and initializes security index
        if (
            deployment_desc.typ == DeploymentType.MAIN_ORCHESTRATOR
            and not deployment_desc.start == StartMode.WITH_GENERATED_ROLES
            and "data" not in deployment_desc.config.roles
        ):
            self.status.set(BlockedStatus(PClusterNoDataNode))
            event.defer()
            return

        # request the start of OpenSearch
        self.status.set(WaitingStatus(RequestUnitServiceOps.format("start")))

        # if this is the first data node to join, start without getting the lock
        ignore_lock = (
            "data" in deployment_desc.config.roles
            and self.unit.is_leader()
            and deployment_desc.typ == DeploymentType.OTHER
            and not self.peers_data.get(Scope.APP, "security_index_initialised", False)
        )
        self._start_opensearch_event.emit(ignore_lock=ignore_lock)

So the entire logic above is set to decide if we will do a restart with our without lock. We should move this logic to the points where we call the restart itself; and remove this restart call as it only makes sense if we have the TLS already available.

canonical / opensearch-operator

[RFE] Simplifying `_on_start` and removing deferrals #465