Description

the admission-webhook pebble service does not start, with the error:

config=main.Config{CertFile:"/etc/webhook/certs/cert.pem", KeyFile:"/etc/webhook/certs/key.pem"} Error: open /etc/webhook/certs/cert.pem: no such file or directory
2024-02-13T11:53:36.729Z [container-agent] 2024-02-13T11:53:36Z ERROR cannot start service: exited quickly with code 255

The webhook service needs to find the certificate files (cert.pem and key.pem) in the workload container directory /etc/webhook/certs. When these cert files are not found, this puts the workload into an usable state, where the webhook service is not up. This is problematic because users cannot create Notebooks, as reported in the issue.

Case where the flow works correctly

In the charm, we upload the certificates to the container on pebble-ready event, so the event sequence for this to be successful is:

install -> leader_elected -> admission_webhook_pebble_ready -> config_changed the admission_webhook_pebble_ready handler is: https://github.com/canonical/admission-webhook-operator/blob/0707be37a97d2f52b785caf670ec98aa66d97ce9/src/charm.py#L305-L311 where on_event method is: https://github.com/canonical/admission-webhook-operator/blob/0707be37a97d2f52b785caf670ec98aa66d97ce9/src/charm.py#L325-L346

and the config_changed handler is set to on_event

the key thing to observe here is when the first call of update_layer is, because that is when the service is attempted to start. chisme's update_layer is where the Pebble service is getting started if the pebble service definition was updated, the method in chisme:

def update_layer(container_name: str, container: Container, new_layer: Layer, logger: Logger):
    """Updates the Pebble configuration layer if changed.

    Args:
        container_name (str): The name of the container to update layer.
        container (ops.model.Container): The container object to update layer.
        new_layer (ops.pebble.Layer): The layer object to be updated to the container.
        logger (logging.Logger): A logger to use for logging.
    """
    if not container.can_connect():
        raise ErrorWithStatus("Waiting for pod startup to complete", MaintenanceStatus)

    current_layer = container.get_plan()

    if current_layer.services != new_layer.services:          # HERE: this is the check referred to
        container.add_layer(container_name, new_layer, combine=True)
        try:
            logger.info("Pebble plan updated with new configuration, replanning")
            container.replan()
        except ChangeError:
            logger.error(traceback.format_exc())
            raise ErrorWithStatus("Failed to replan", BlockedStatus)

the code flow would be:

pebble_ready handler executes -> certificates are pushed to container -> on_event is called -> update_layer first call so the service is started correctly
config_changed handler executes on_event -> update_layer is called for the second time -> services equivalence check does not pass, so the service is not restarted

Cases where it can go wrong

Case 1

If update_layer gets called for the first time when the cert files are not yet uploaded, this can happen when the following conditions are true:

any event that calls update_layer gets fired before pebble_ready
container.can_connect() returns True, so the update_layer function does not raise an Error and proceeds to execute container.replan() which starts the service

We know that condition 1. is possible for charm execution, for example config_changed can happen before pebble_ready
For condition 2. to apply, this means that pebble_ready event was fired as well, BUT there is no guarantee that pebble_ready handler has executed, it is a possible race condition here.

Race condition

the error-prone scenario is like this:

config_changed is fired
config_changed handler execution starts
before config_changed handler execution reaches the container.can_connect() check in update_layer, the container becomes ready
pebble_ready event is queued, but the config_changed handler is in the middle of execution so it continues
config_changed handler container.can_connect() check in update_layer returns True
config_changed handler adds the layer and calls container.replan()
the error is hit because the cert files are not found
pebble_ready handler executes
the service equivalence check if current_layer.services != new_layer.services: returns False because the service is the same
pebble_ready handler skips adding the layer and container.replan() i.e. it does not restart the service

At this state, the charm is unable to recover from the error, because every handler will skip restarting the service.

Case 2

If the pebble_ready handler executes, but somehow the cert files are not uploaded. Initially, we thought this is possible if the pebble push API was non-blocking, meaning that the charm continues to the next handler before the files have completed copying. To eliminate this possibility, I tested with pushing files of different sizes to the container while calculating the time it took to push them. I observed that bigger files take more time to execute. Also, I added a check that pebble pull API does not raise an error right after the push call. so concluding that this case is unlikely.

syncronize-issues-to-jira[bot] commented 6 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5376.

This message was autogenerated

NohaIhab commented 6 months ago

Health checks

the admission-webhook charm has a pebble health check in place:

    def _admission_webhook_layer(self) -> Layer:
        """Create and return Pebble framework layer."""
        layer_config = {
            "summary": "admission-webhook layer",
            "description": "Pebble config layer for admission-webhook-operator",
            "services": {
                self._container_name: {
                    .....
                    # Service details
                    .....
                    "on-check-failure": {"admission-webhook-up": "restart"},
                },
            },
            "checks": {
                "admission-webhook-up": {
                    "override": "replace",
                    "period": "30s",
                    "timeout": "20s",
                    "threshold": 4,
                    "tcp": {"port": self._port},
                }
            },
        }

If functioning correctly, the pebble service should be restarted when the service port is not listening on self._port From what we see in the logs, the check does fail, but it is not restarting the service. This is unexpected behavior from Pebble.

Issues with the pebble check

we see in the logs Check "admission-webhook-up" failure threshold 4 hit, triggering action when the threshold is hit for the first time, and yet the service is not restarted.
we see this log Check "admission-webhook-up" failure threshold 4 hit, triggering action only the first time that the threshold is hit, then all the other reported failures do not log this.

NohaIhab commented 6 months ago

Proposed fix

to prevent Case 1 scenario discussed above from happening, a check that the cert files exist in the container can be added before going into update_layer. If the cert files don't exist, we can just log that and do nothing, knowing that pebble_ready handler will be triggered later and the certs will be uploaded then. fix now in PR #125

NohaIhab commented 5 months ago

closing since this was fixed in #125 and backported in #126

canonical / admission-webhook-operator

pebble service cannot start with `Error: open /etc/webhook/certs/cert.pem: no such file or directory` #124