canonical / pebble

Take control of your internal daemons!
https://canonical-pebble.readthedocs-hosted.com/
GNU General Public License v3.0
143 stars 54 forks source link

feat(overlord): reversed order of manager stop #447

Closed Ardelean-Calin closed 1 month ago

Ardelean-Calin commented 1 month ago

Currently managers are started and stopped in a FIFO (first in, first out) order, but I would expect managers to behave more like a stack, in that they are started and stopped in a LIFO (last in first out) order. So the stopping of the managers should be done in reverse.

See this image: image

Why? Because what if manager M2 started after manager M1 depends on some functionality from M1? Stopping M1 before stopping M2 might lead to unexpected and undefined behavior.

I chose to implement this behavior with the defer keyword as it behaves in a LIFO manner, but this could also be done by reverse iterating over the slice. I found the defer approach to be more concise, but feel free to correct me.

@benhoyt could you please take a look at this and advise me if I'm missing something?

flotter commented 1 month ago

@Ardelean-Calin Just a note here that the state engine is cross team maintained between pebble, snapd and some other teams (workspace engineering @dmitry-lyfar), so we should make sure the changes we make is double checked with them as well, otherwise a future cross-repo sync of the state code could slip in some changes that break existing critical functioning systems.

Note that the TaskRunner state-engine entry is explicitly done last for startup, and stop today. The task runner is special in that stopping it will kill and wait for all task tombs in the system to exit (block). Inverting this order will place task kill & wait first, which means state engine managers interacting with tasks will possibly observe a slightly different pattern, where a task can die not by the manager's own intent, but by the Task runner shutting down. This could be no issue, but I am pointing this out just to say that this very trivial change could expose a different execution path for snapd / pebble on shutdown, for interaction between managers and the task runner. Probably just needs some thorough testing on existing code bases.

dmitry-lyfar commented 1 month ago

@Ardelean-Calin could you give an example of such a M1->M2 dependency that justifies the order change? My understanding is that managers interact solely via the state but do not depend on each other. Except for the mentioned case of the taskrunner that must be added last.

benhoyt commented 1 month ago

This does seem like a reasonable idea to me. FWIW, I've tried making this change on the snapd codebase, and the go test tests still seem to pass (though there were a lot of failing tests due to me not having installed a bunch of deps, but I don't think that's related). I've asked the snapd team what they think and I'll report back.

thp-canonical commented 1 month ago

My understanding is that managers interact solely via the state but do not depend on each other.

The linked "solely via the state" quote seems to relate to .Ensure() calls at runtime, not necessarily for shutdown time (this quote is from Pebble to get nice inline source snippets in this comment, but snapd has the same text, as the two codebases are related):

https://github.com/canonical/pebble/blob/35e2b7cf6ce5be1c3f8b803b281ad8697e8c14a5/internals/overlord/stateengine.go#L58-L61

Of course, there could be some hidden assumptions and code relying on the current Stop() call order, so we have to be careful here, but in general, teardown is usually in reverse order of setup calls (compare also: C++ inheritance and member destruction; defer in Go; RAII and destruction order of local (stack) variables in C++; getting onto a bus and sitting down, followed by standing up first before getting off the bus).

dmitry-lyfar commented 1 month ago

@thp-canonical I'm not opposing the order's correctness but rather changing it and introducing such coupling that you have to depend on order in addition to already existing dependency on the taskrunner's order.

pedronis commented 1 month ago

I'm not against this. snapd is quite deliberate in the order in which it initializes managers, I don't think we do anything that requires precise ordering of Stops, but as I said it shouldn't be a problem either. You'll have to write precise tests though to make sure your managers are initialized in the order you need, as otherwise it seems this can get fragile. As people have mentioned the original idea was that managers would be orthogonal to each other, in practice some managers depend on others and that is setup by passing one to the other at initialization but then as you have seen the StateEngine is not aware of these except indirectly by seeing the AddManager order.

benhoyt commented 1 month ago

Okay, with Samuele being "not against this", I'd like to merge this. Managers shouldn't generally be dependent on each other, but stopping them in reverse order definitely seems least surprising (I've added a comment to that effect).

I've also modified the code to directly iterate through the slice in reverse, rather than using defer in a loop to do the reversal (that is cute but seems less obvious to me).

As a drive-by I tweaked the (existing) debug log message to use our usual formatting: start with a capital letter, full phrases/sentences.

Ideally after we merge we can put up a similar PR against snapd.

Any final thoughts before I merge, @Ardelean-Calin or others?

Ardelean-Calin commented 1 month ago

Sorry for missing this thread! I don't have any more comments, will take care of the snapd tests. 😀 Thanks @benhoyt for the additional changes and feedback!