canonical / operator

Pure Python framework for writing Juju charms
Apache License 2.0
244 stars 115 forks source link

How to test charms that rely on Container.can_connect? #691

Closed mmanciop closed 2 years ago

mmanciop commented 2 years ago

It seems to me that Container.can_connect will always return True within the test harness. That requires cumbersome mocking in test code. I would rather expect that Container.can_connect returns False until one explicitly triggers the pebble-ready event for that container through the Harness, although we need to be careful to correctly handle the queue of deferred events that may be waiting precisely for Container.can_connect to return True.

pengale commented 2 years ago

Interesting. Doing some spelunking in the code, I don't see any place where we're explicitly mocking out Container.can_connect. I think that it Just Works based on the fake pebble implementation.

If you do mean things to fake pebble, do you get the "False" that you'd expect from .can_connect in your tests?

mmanciop commented 2 years ago

I think there would be value in:

  1. Have the Container.can_connect return False by default
  2. Have the Container.can_connect return True after Harness.container_pebble_ready is invoked for the respective container.
  3. Provide ergonomic means (that is: no @patch :P) to make Container.can_connect return True for a specific container with a helper method (maybe on Harness) so that one can test programmatically scenarios in which an event is deferred, and the queue is triggered by the pebble_ready event. Although, for this last one, tbh I think the real issue for my wish might lay in the very limited control over the event queue I think we have in Harness.
pengale commented 2 years ago

After some discussion, I think that the best approach is something like the following:

1) By default, Container.can_connect returns True. That's in keeping with how the rest of the harness works, where everything is in a "good" state by default. 2) We add something like a "set_container_state" routine to the relevant fixtures, so that you can do something like "set_container_state(ready=False)", then run the test that is meant to handle the case where the container is not ready.

Why do things this way, rather than the other way around? Well, the harness is meant to save you writing boilerplate. Most of your tests will probably rely on the case where .can_connect returns True, and you'd probably prefer to writing no code in those cases, explicitly setting the pebble_ready state only when you want to.

As a bonus, this discourages charm authors from assuming that Container.can_connect will always return True so long as some set of steps have completed. In production, Kubernetes might reschedule your pod, flipping .can_connect back to False while your charm is busy trying to do other things. A test suite that only takes into account the initial setup will not guard against this, and your charm will be more brittle as a result.

mmanciop commented 2 years ago

I could live with this proposal, although I then expect most charms simply (continuing) not to test the “unhappy path” of pebble not being ready yet. There is a lot of complexity to handle in the indeterminism ordering of events that occur to a pod that is scheduled to replace one that crashed and got churned. Admittedly, however, this is a much larger discussion than the matter at hand.

In production, Kubernetes might reschedule your pod, flipping .can_connect back to False while your charm is busy trying to do other things.

I am not sure this can happen: the charm code lives in the pod itself. If K8s churns the pod, that entire charm unit is gone together with the code running any hook. If a workload container (i.e., a container listed in metadata.yaml) crashed, the whole pod is gone, so it’s behavior is more like Harness.begin_with_initial_hooks() than it is flipping the can_connect state. Moreover, have seen situations in the past where pebble simply bricked, but that is not something a unit actually recovers from.

jnsgruk commented 2 years ago

Well, in the case of a Pod being rescheduled then you would indeed lose the charm and can_connect would cease to be your issue 😉

However: it is possible for a workload container to crash half way through a hook execution, or somehow become unavailable, and that is the case the can_connect method was designed to catch.

I think having a set_container_reachable('container', True) or similar in the harness, as we do for set_planned_units should be fine. Personally I have just been mocking can_connect, but providing a handrail seems reasonable.

mmanciop commented 2 years ago

I think we need to be more specific in terms of “container crash”. If the pebble process dies in a workload container, the entire pod is gone, in its entirety, including the charm container, immediately (iirc one can set a termination grace period, but Juju does not, so…). This is because the workload containers are not “optional”, and pebble is the command the container runs and that the container runtime predicated container liveness on. As an experiment, run a microk8s.kubectl exec <unit> -n <namespace> -c <workload_container> —pkill pebble and see just how fast the entire pod kicks the bucket :-)

I find can_connect super useful for the following situation: very early in the life of the pod, my charm unit receives an event, the hook of which needs to do something through pebble in one or more workload containers. We do not know whether pebble ready was called already, but can_connect acts as a surrogate for it. This is especially true when we decide to defer events because pebble is not ready yet. When pebble does indeed become ready, the deferred hooks will run first, and there being able to check the readiness of pebble through can_connect is just the thing we need.

mmanciop commented 2 years ago

P.S. When I spoke about a container being “optional”, I referred to the equivalent of a Fargate Task being marked as not essential. I was sure K8s would allow to mark containers as “if this crashes don’t churn the entire pod”, but I must have dreamed that up.

pengale commented 2 years ago

Just to be clear: your pod does go away pretty quickly, and then come back pretty quickly. But the more you scale an app, the more you increase your chances that the teardown or setup will happen in the middle of a hook execution. I think that charms should always be written to robustly handle a service disappearing, or otherwise being unexpectedly out of commission for a little while, and this is true even of internal services like Pebble.

While the Operator Framework does go to some lengths to protect authors from having to think about things like this, I think that it's appropriate for the tests to push folks towards testing not just the happy path, with its somewhat well defined unhappy moments during the initial hooks run, but also all the unhappy paths that a large and busy cloud may run into.

There has been some discussion about how to actually define some common unhappy path scenarios, and provide fixtures in the test harness to highlight them, and encourage testing them. @rwcarlsen started a "big brain" issue here: https://github.com/canonical/operator/issues/696, and your comments would be welcome!

mmanciop commented 2 years ago

Just to be clear: your pod does go away pretty quickly, and then come back pretty quickly.

That, it totally does. But in the process of coming back, it invalidates a bunch of (overly-optimistic or simplistic) assumptions by the charm author. It is very seldom that I cannot break a charm the first time I play with it using kubectl delete pod (another great source of bugs are upgrade hooks :-) ). Although, to be fair (and admittedly going off a tangent here), I am under the impression that the biggest source of bugs on upgrade and pod-churn scenarios is the non-deterministic, non-trivial set of events that can be fired depending on which relations are already established with the Juju app.

rwcarlsen commented 2 years ago

Well if we expect the harness to operate on the "happy path" by default, in that scenario in the real/juju world, can_connect is generally false until the pebble ready hook runs. So perhaps we could consider adding that sort of behavior.

mmanciop commented 2 years ago

Well if we expect the harness to operate on the "happy path" by default, in that scenario in the real/juju world, can_connect is generally false until the pebble ready hook runs. So perhaps we could consider adding that sort of behavior.

I have very complex feelings about the "happy path" in relation with a test harness to test event-driven, state-implicit state machines. Based on months of finding serious bugs just a hair out of the happy path, I can state in no uncertain terms that it luls people in a false sense of security.

This being said, adding the behavior for can_connect being false until pebble_run is in my eyes both an excellent idea, and something that may break people's tests. A minimal due diligence via cursory search yields a handful of mocks in tests (good, those devs know that something's up :D) but, overall, a conspicuous general absence of can_connect in test code. So I am +100 on this.