canonical / pebble

Pebble is a lightweight Linux service manager with layered configuration and an HTTP API.
https://canonical-pebble.readthedocs-hosted.com/
GNU General Public License v3.0
145 stars 54 forks source link

Could it make sense to add official support for changes/tasks with shorter lifetimes (process lifetime / boot lifetime) #432

Open flotter opened 3 months ago

flotter commented 3 months ago

Today, once a change/task set is added to the state engine in Pebble, the state engine will forever try to complete the task. This means it will not give up on it until it reaches a Ready() state, even after the Pebble process restarts, or the machine reboots (as long as the state file is persisted / not deleted).

Any state not captured below means the task will continue, I believe.

// Ready returns whether a task or change with this status needs further
// work or has completed its attempt to perform the current goal.
func (s Status) Ready() bool {
    switch s {
    case DoneStatus, UndoneStatus, HoldStatus, ErrorStatus:
        return true
    }
    return false
}

While reviewing Checks, and also while I am working on another overlord manager, I have seen examples where the job the change is performing, relates to a resource that may not exist at some point in the future (e.g. after a restart)

Examples:

  1. Imagine that as a result of a HTTP API request or Pebble client command, some data is sent to the Pebble daemon, which is processing this data stream through a change request. If an interruption happens, and the machine is powered down mid operation, and powered up 1 day later, the original context of the change is no longer relevant.

  2. Imagine a state machine performing hardware related actions using changes. If a hardware device such as an USB device is connected, which triggers work to be performed as a change, an interruption occurs (power cut / crash), and the machine restarts with the hardware no longer attached, the original context of the change is no longer relevant.

I am wondering whether it could make sense to better support this in the state code.

For example: This could be a type of Change/Task or Change/Task attribute, where the state engine understand both the boot context (distinguish between reboots) and process context (distinguish between process restarts), and auto cancel them if the contexts no longer applies (without triggering undo - which also no longer makes sense) ?

flotter commented 3 months ago

Another related concept that currently is difficult to deal with is ownership of changes by a manager.

Currently, if you want to find all the changes that belongs to your manager, you have to search by Kind. However, any one manager may have multiple kinds of changes, so this becomes a list of options that has to be maintained over the manager code.|

Perhaps this could be opportunity to look at a change group ID (owner ID) concept. The current way grouping is achieved is to Set a change or task attribute, so that a lookup can be done using a string key: change.Has() or task.Has.

This logic is already used to prevent pruning: see RegisterPendingChangeByAttr.

Perhaps something like Register<limited lifetime>ChangeByAttr

benhoyt commented 1 month ago

Just for the record from a conversation with @flotter, after I asked him whether this is something we should discuss further:

I think the problem persists, with only the workaround currently used in pebble->checks: https://github.com/canonical/pebble/blob/5c8a9abac64f96182e0ff70f71b7801440e7abdb/internals/overlord/checkstate/manager.go#L124

In my own project, I did something slightly different (we cannot cause an undo). All my tasks has a execution context, and I cancel running tasks at StartUp which contains the property, and which do not match the current running context.

If we could introduce the concept similar to /proc/sys/kernel/random/boot_id in the state, and have the ability to define the lifetime of a task to boot duration, the state package at StartUp (I think its starts first) could abort related tasks automatically. I think the mechanisms are there, but having a feature that deals with this for you, would be nice.

Kindof related, was task ownership. If we have to filter on task change events, it would have been nice to have a first class concept of show me events for my tasks. Again, the primitives required to implement this may already exist, but it could be nice to support this with some API function of state.

Happy to brainstorm if I could be of further help.