Scaling scrape config results in incorrect status

balbirthomas commented 2 years ago

Issue

Deploy Prometheus, Scrape config and some client charm then relate them appropriately
Now scale scrape config charm to have more than 1 unit
Expected Behaviour : All the units of scrape config should have the same status
Actual Behaviour: The new unit does not attain active status

balbirthomas commented 2 years ago

Reviewing the code and commit logs now I see

    def _on_metrics_provider_relation_broken(self, event):
        """Block the charm when no charms contribute scrape jobs."""
        if not self.unit.is_leader():
            self.unit.status = WaitingStatus("inactive unit")
            return

and

    def _update_new_metrics_consumer(self, event):
        """Set Prometheus scrape configuration for all targets."""
        logger.debug("Updating new metrics consumer")

        if not self.unit.is_leader():
            self.unit.status = WaitingStatus("inactive unit")
            return

and so on. So it seems in the past a decision was made that if a unit is not a learder unit then it will do nothing and later on this behavior was extended to ensure non leader units set WaitingStatus with message "inactive unit". This does make sense since Prometheus scrape config is meant to be an intermediary charm and the value of scaling it can be questionable.

In view of the above

shall we consider closing this bug with perhaps an integration test to assert this behavior.
or alternatively shall we allow scaling this charm which also implies that the Metrics consumer will receive multiple identical Jobs

Abuelodelanada commented 2 years ago

What side effects will produce that the Metrics consumer will receive multiple identical Jobs??

balbirthomas commented 2 years ago

What side effects will produce that the Metrics consumer will receive multiple identical Jobs??

There is de-duplication of scrape jobs (at least names) that @dstathis implemented for the Metrics consumer (Prometheus charm). Also as it stands if the leader guards shown above are removed non-leader units will produce an error when they try to set application relation data (the scrape jobs).

However all these issues can be worked around if there is a desire to do so. The key question is do we want to do so, and what do we gain by doing so.

balbirthomas commented 2 years ago

A branch with integration test asserting that non leader units set waiting status has been created here. This branch depends on addition of an alert rule to Zinc k8s charm for which another PR has been created here.

canonical / prometheus-scrape-config-k8s-operator

Scaling scrape config results in incorrect status #9

Issue