canonical / pebble

Take control of your internal daemons!
GNU General Public License v3.0
136 stars 51 forks source link

Cached state will not exist for persisted check after a reboot #414

Open flotter opened 2 months ago

flotter commented 2 months ago
[unrecovered-panic] runtime.fatalpanic() .snap/go/10584/src/runtime/panic.go:1188 (hits goroutine(1):1 total:1) (PC: 0x43df84)
Warning: debugging optimized function
    runtime.curg._panic.arg: interface {}(string) "interface conversion: interface {} is nil, not *plan.Check"
(dlv) bt
 0  0x000000000043df84 in runtime.fatalpanic
    at .snap/go/10584/src/runtime/panic.go:1188
 1  0x000000000043d759 in runtime.gopanic
    at .snap/go/10584/src/runtime/panic.go:1017
 2  0x0000000000ac2d7c in github.com/canonical/pebble/internals/cli.Run.func1
    at .root/go/pkg/mod/github.com/canonical/pebble@v1.10.1-0.20240424033813-4e93e7b13965/internals/cli/cli.go:298
 3  0x000000000043d610 in runtime.gopanic
    at .snap/go/10584/src/runtime/panic.go:920
 4  0x000000000040c685 in runtime.panicdottypeE
    at .snap/go/10584/src/runtime/iface.go:263
 5  0x0000000000994ce5 in github.com/canonical/pebble/internals/overlord/checkstate.(*CheckManager).PlanChanged
    at .root/go/pkg/mod/github.com/canonical/pebble@v1.10.1-0.20240424033813-4e93e7b13965/internals/overlord/checkstate/manager.go:125
 6  0x00000000009e980f in github.com/canonical/pebble/internals/overlord/checkstate.(*CheckManager).PlanChanged-fm
    at <autogenerated>:1
 7  0x00000000009ce9fc in github.com/canonical/pebble/internals/overlord/planstate.(*PlanManager).planChanged
    at .root/go/pkg/mod/github.com/canonical/pebble@v1.10.1-0.20240424033813-4e93e7b13965/internals/overlord/planstate/manager.go:87
 8  0x00000000009ce690 in github.com/canonical/pebble/internals/overlord/planstate.(*PlanManager).Load
    at .root/go/pkg/mod/github.com/canonical/pebble@v1.10.1-0.20240424033813-4e93e7b13965/internals/overlord/planstate/manager.go:67
 9  0x00000000009e5949 in github.com/canonical/pebble/internals/overlord.New
    at .root/go/pkg/mod/github.com/canonical/pebble@v1.10.1-0.20240424033813-4e93e7b13965/internals/overlord/overlord.go:203
10  0x0000000000a0b4c8 in github.com/canonical/pebble/internals/daemon.New
    at .root/go/pkg/mod/github.com/canonical/pebble@v1.10.1-0.20240424033813-4e93e7b13965/internals/daemon/daemon.go:786
:
13  0x00000000005866d2 in github.com/canonical/go-flags.(*Parser).ParseArgs
    at .root/go/pkg/mod/github.com/canonical/go-flags@v0.0.0-20230403090104-105d09a091b8/parser.go:335
14  0x00000000005854c5 in github.com/canonical/go-flags.(*Parser).Parse
    at .root/go/pkg/mod/github.com/canonical/go-flags@v0.0.0-20230403090104-105d09a091b8/parser.go:191
15  0x0000000000a9f78b in github.com/canonical/pebble/internals/cli.Run
    at .root/go/pkg/mod/github.com/canonical/pebble@v1.10.1-0.20240424033813-4e93e7b13965/internals/cli/cli.go:327
16  0x0000000000acd97f in main.main
    at .home/flotter/XXXXX/cmd/XXXXX/main.go:90
17  0x0000000000440327 in runtime.main
    at .snap/go/10584/src/runtime/proc.go:267
18  0x00000000004717a1 in runtime.goexit
    at .snap/go/10584/src/runtime/asm_amd64.s:1650
   120:             if change.Kind() == performCheckKind {
   121:                 configKey = performConfigKey{change.ID()}
   122:             } else {
   123:                 configKey = recoverConfigKey{change.ID()}
   124:             }
=> 125:             oldConfig := m.state.Cached(configKey).(*plan.Check) // panic if key not present (always should be)
   126:             existingChecks[oldConfig.Name] = true
   127: 
   128:             newConfig, inNew := newPlan.Checks[details.Name]
   129:             if inNew {
   130:                 merged := mergeServiceContext(newPlan, newConfig)
flotter commented 2 months ago

Discussed with @benhoyt and will propose a couple of PRs soon.

  1. Plan Manager is notifying all managers of a plan change before the state engine has started (overlord loop). This means managers have no opportunity to remove leftover persisted tasks during manager StartUp before the manager's change handler starts processing the new plan.

  2. Gracefully handle missing Cached lookups, as this almost certainly means a leftover check from the state engine started running after a reboot. In this case we likely do not have any undo tasks associated with the handlers in question, so we could simply call Abort on them.

flotter commented 2 months ago

This could be fix for this issue identified above, but I would still like to look at improving the Plan Manager so that it does not propagate a plan change before managers can really cope with it (before the state engine is ready).

https://github.com/canonical/pebble/pull/415