canonical / pebble

Take control of your internal daemons!
https://canonical-pebble.readthedocs-hosted.com/
GNU General Public License v3.0
143 stars 54 forks source link

Pebble is unresponsive #314

Closed troyanov closed 6 months ago

troyanov commented 11 months ago

Some odd behaviour of Pebble was observed several times (1.2.0 and 1.4.0 versions):

Pebble daemon starts normally and is able to spin up couple services and process several API calls. Then, quite quickly, it falls into a limbo state (thats how I understand it from the logs).

After 05 Oct 04:14:41 there are no new log entries and any CLI command execution that interacts with the daemon (e.g. pebble changes) simply hang.

$ sudo journalctl -feu snap.maas.pebble.service -t maas.pebble
...
Oct 05 04:14:38 maas.pebble[754145]: 2023-10-05T04:14:38.235Z [pebble] Started daemon.
Oct 05 04:14:38 maas.pebble[754145]: 2023-10-05T04:14:38.470Z [pebble] POST /v1/services 225.462757ms 202
Oct 05 04:14:38 maas.pebble[754145]: 2023-10-05T04:14:38.470Z [pebble] Started default services with change 40183.
Oct 05 04:14:38 maas.pebble[754145]: 2023-10-05T04:14:38.538Z [pebble] Service "regiond" starting: sh -c "exec systemd-cat -t maas-regiond $SNAP/bin/run-regiond"
Oct 05 04:14:39 maas.pebble[754145]: 2023-10-05T04:14:39.647Z [pebble] Service "apiserver" starting: sh -c "exec systemd-cat -t maas-apiserver $SNAP/bin/run-apiserver"
Oct 05 04:14:40 maas.pebble[754145]: 2023-10-05T04:14:40.752Z [pebble] Service "rackd" starting: sh -c "exec systemd-cat -t maas-rackd $SNAP/bin/run-rackd"
Oct 05 04:14:41 maas.pebble[754145]: 2023-10-05T04:14:41.573Z [pebble] GET /v1/services?names=bind9 120.939µs 200
Oct 05 04:14:41 maas.pebble[754145]: 2023-10-05T04:14:41.574Z [pebble] GET /v1/services?names=ntp 86.887µs 200
Oct 05 04:14:41 maas.pebble[754145]: 2023-10-05T04:14:41.574Z [pebble] GET /v1/services?names=syslog 68.915µs 200
Oct 05 04:14:41 maas.pebble[754145]: 2023-10-05T04:14:41.575Z [pebble] GET /v1/services?names=http 65.915µs 200
Oct 05 04:14:41 maas.pebble[754145]: 2023-10-05T04:14:41.576Z [pebble] GET /v1/services?names=temporal 47.774µs 200
Oct 05 04:14:41 maas.pebble[754145]: 2023-10-05T04:14:41.576Z [pebble] GET /v1/services?names=temporal 47.667µs 200

Sending SIGTERM doesn't help much:

Oct 05 14:59:47 maas.pebble[754145]: 2023-10-05T14:59:47.010Z [pebble] Exiting on interrupt signal.

I was unable to strace the daemon, however here is a strace of a pebble services call

[pid 1831038] 13:26:08.444799 (+     0.000015) getpeername(6, {sa_family=AF_UNIX, sun_path="/var/snap/maas/31006/pebble/.pebble.socket"}, [112->45]) = 0                                      
[pid 1831038] 13:26:08.444849 (+     0.000049) futex(0xc000100148, FUTEX_WAKE_PRIVATE, 1) = 1                                                                                                 
[pid 1831042] 13:26:08.444879 (+     0.000030) <... futex resumed>) = 0                                                                                                                       
[pid 1831042] 13:26:08.444907 (+     0.000027) futex(0xc00008c948, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>                                                                                     
[pid 1831038] 13:26:08.444925 (+     0.000017) read(6,  <unfinished ...>                                                                                                                      
[pid 1831040] 13:26:08.444936 (+     0.000011) <... futex resumed>) = 0                                                                                                                       
[pid 1831042] 13:26:08.444947 (+     0.000010) <... futex resumed>) = 1                                                                                                                       
[pid 1831038] 13:26:08.444955 (+     0.000008) <... read resumed>0xc0001b7000, 4096) = -1 EAGAIN (Resource temporarily unavailable)                                                           
[pid 1831040] 13:26:08.444970 (+     0.000014) futex(0xc00008c948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>                                                                               
[pid 1831038] 13:26:08.444982 (+     0.000011) futex(0xb2f228, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>                                                                                   
[pid 1831042] 13:26:08.445113 (+     0.000131) write(6, "GET /v1/services?names= HTTP/1.1"..., 108) = 108                                                                                     
[pid 1831042] 13:26:08.445189 (+     0.000076) futex(0xb2f228, FUTEX_WAKE_PRIVATE, 1) = 1                                                                                                     
[pid 1831038] 13:26:08.445221 (+     0.000031) <... futex resumed>) = 0                                                                                                                       
[pid 1831041] 13:26:08.445233 (+     0.000012) <... epoll_pwait resumed>[{EPOLLOUT, {u32=2994289048, u64=139997453302168}}], 128, 249, NULL, 0) = 1                                           
[pid 1831042] 13:26:08.445253 (+     0.000019) futex(0xc000100148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>                                                                               
[pid 1831038] 13:26:08.445264 (+     0.000011) futex(0xb2f228, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>                                                                                   
[pid 1831041] 13:26:08.445276 (+     0.000012) epoll_pwait(3, [], 128, 0, NULL, 0) = 0                                                                                                        
[pid 1831041] 13:26:08.445321 (+     0.000044) epoll_pwait(3,  <unfinished ...>
[pid 1831039] 13:26:08.445393 (+     0.000071) futex(0xb2f5e0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=248775021} <unfinished ...>
[pid 1831041] 13:26:08.693712 (+     0.248341) <... epoll_pwait resumed>[], 128, 248, NULL, 0) = 0
[pid 1831041] 13:26:08.693775 (+     0.000053) epoll_pwait(3, [], 128, 0, NULL, 0) = 0
[pid 1831041] 13:26:08.693841 (+     0.000053) epoll_pwait(3,  <unfinished ...>
[pid 1831039] 13:26:08.694339 (+     0.000503) <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)

I don't have a stable reproducer yet and meanwhile I will try to add pprof for execution tracer. Maybe it is related to https://github.com/canonical/pebble/issues/163

troyanov commented 11 months ago

Just in case, here is maas.pebble.stack.log obtained by executing pkill -TRAP pebble

sparkiegeek commented 10 months ago

How MAAS builds pebble (in a snap):

parts:
  pebble:
    plugin: go
    build-packages:
      - golang-go
    source-type: git
    source: https://github.com/canonical/pebble
    source-tag: v1.4.0

Running pebble:

environment:
  LD_PRELOAD: $SNAP/usr/lib/snap-preload.so
  PEBBLE: $SNAP_COMMON/pebble

apps:
  pebble:
    daemon: simple
    command: bin/run-pebble-daemon
    plugs:
      - avahi-observe
      - hardware-observe
      - kernel-module-observe
      - mount-observe
      - network
      - network-bind
      - network-control
      - network-observe
      - snap-refresh-control
      - system-observe
      - time-control

Where 'bin/run-pebble-daemon' is simply exec pebble "$@"

https://github.com/maas/maas/tree/master/snap/local/tree/usr/share/maas/pebble/layers is where the layers are

The LD_PRELOAD is a build of https://github.com/maas/snap-preload

cjdcordeiro commented 10 months ago

I've experienced Pebble getting stuck whenever trying to run services that either 1) run with users that are more privileged than the user executing Pebble, or 2) access directories the snap doesn't have access to.

This is one of the reasons why the official pebble snap is classic.

benhoyt commented 10 months ago

@cjdcordeiro Thanks, that's a good data point. Do you know why that caused Pebble to get stuck, and do you remember what "getting stuck" meant or looked like in your case?

cjdcordeiro commented 10 months ago

Well, the "stuck" was misleading, cause it would eventually fail, after some time. I do remember this happening when we had the issue of running pebble as foo, and then asking it to run a service with the same user (user: foo)

sparkiegeek commented 9 months ago

We've forked pebble and are attempting to capture better traces - https://github.com/maas/pebble/tree/1.5.0-pprof

troyanov commented 7 months ago

We didn't observe this problem with pprof enabled, so we decided to switch back to 1.7.0 After several days we hit this issue again. maas.pebble.trace.txt

From what I see, there are many goroutines waiting to acquire this lock https://github.com/canonical/pebble/blob/3abbadccd2d771b8032e30b7193ec6b7d63f853f/internals/overlord/state/state.go#L129

benhoyt commented 7 months ago

Thanks, @troyanov, that sounds like good information. I'll aim to take a look at this today or tomorrow.

benhoyt commented 7 months ago

Just a quick update: I've combed over the goroutine dump, and it's definitely some kind of mutex deadlock. There are several mutexes involved (State.mu, ServiceManager.servicesLock, ServiceManager.planLock). Maybe they're mutually waiting on one another, or maybe we've forgotten an Unlock somewhere. Anyway, still digging into it. Will pick it back up on Monday and maybe rubber-duck it with another Gopher.

benhoyt commented 7 months ago

So yeah, this is almost certainly a deadlock, and it's likely happening due to the default behaviour of Go's net/http package, which is to recover panics and keep serving. Per the docs:

If ServeHTTP panics, the server (the caller of ServeHTTP) assumes that the effect of the panic was isolated to the active request. It recovers the panic, logs a stack trace to the server error log, and either closes the network connection or sends an HTTP/2 RST_STREAM, depending on the HTTP protocol.

So if a panic happens in a ServeHTTP goroutine when one of the mutex locks is held (for example, the main State.mu lock, which is effectively a singleton), the server will get stuck and be completely unresponsive after that. When I add a test panic between an st.Lock() and st.Unlock() to try that, sure enough, I get these exact symptoms, and sending SIGTRAP spits out a similar-looking goroutine dump as the one you attached above.

We normally use defer st.Unlock(), which would unlock the lock properly even in case of a panic. However, we use plain Unlock() without defer in quite a number of places. So I suspect we have a possible panic hiding between a Lock() and an Unlock() somewhere. I'll do a more thorough comb-through of those later.

For now, I think we should probably change Pebble to not recover panics. There are two Go bugs open asking for a way to opt out of this net/http behaviour, but they decided not to change the default or add a way to opt-out for compatibility reasons. This bug shows a workaround to opt out, and the other bug describes exactly this case -- mutexes left locked + panic recovery causing issues:

As a concrete example of this, I'm looking into a bug where a handler crashed in code that tracks request statistics and left a mutex locked. Future requests blocked on this lock, piling up deadlocked goroutines. We'd have been much better off if the process had simply crashed and been restarted.

So I'll put up a PR to make that change, that is, opt out of recovering from panics. That will at least make it fail hard and we'll see what's causing the panic and deadlock in the first place.

I'll also investigate the 45 places we use a non-deferred Unlock() and see if any of them look suspect. We should really change all of those to use defer (possibly inside a func() { ... }() block) -- maybe I'll put up a PR for that too.

One other thing for @troyanov and team to try: If the above analysis is correct and a panic is happening, the original panic should be printed to the Pebble log (stderr). Are you storing the Pebble logs somewhere? Do you see anything like the following in it? ("http: panic serving" would be a good search term).

2024/02/05 15:39:37 http: panic serving pid=386038;uid=1000;socket=/home/ben/pebble/.pebble.socket;: runtime error: index out of range [10] with length 3
goroutine 25 [running]:
net/http.(*conn).serve.func1()
    /home/ben/sdk/go1.21.0/src/net/http/server.go:1868 +0xb9
panic({0x816280?, 0xc000334090?})
    /home/ben/sdk/go1.21.0/src/runtime/panic.go:920 +0x270
github.com/canonical/pebble/internals/daemon.doStuffThatMightPanic(...)
    /home/ben/w/pebble/internals/daemon/api_changes.go:187
github.com/canonical/pebble/internals/daemon.v1GetChangeWait(0xb49c20, 0x3?, 0xc00031a070?)
    /home/ben/w/pebble/internals/daemon/api_changes.go:195 +0x91
github.com/canonical/pebble/internals/daemon.(*Command).ServeHTTP(0xb49c20, {0x8ea518, 0xc00031c078}, 0xc00032e500)
    /home/ben/w/pebble/internals/daemon/daemon.go:272 +0x26e
github.com/gorilla/mux.(*Router).ServeHTTP(0xc00014e000, {0x8ea518, 0xc00031c078}, 0xc00032e300)
    /home/ben/go/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210 +0x1c5
github.com/canonical/pebble/internals/daemon.(*Daemon).Start.logit.func3({0x8ea758?, 0xc0003380e0}, 0xc00032e300)
    /home/ben/w/pebble/internals/daemon/daemon.go:343 +0xd9
net/http.HandlerFunc.ServeHTTP(0xb81280?, {0x8ea758?, 0xc0003380e0?}, 0xc0003bfb50?)
    /home/ben/sdk/go1.21.0/src/net/http/server.go:2136 +0x29
net/http.serverHandler.ServeHTTP({0xc000119f80?}, {0x8ea758?, 0xc0003380e0?}, 0x6?)
    /home/ben/sdk/go1.21.0/src/net/http/server.go:2938 +0x8e
net/http.(*conn).serve(0xc000162a20, {0x8eac70, 0xc0001fe570})
    /home/ben/sdk/go1.21.0/src/net/http/server.go:2009 +0x5f4
created by net/http.(*Server).Serve in goroutine 21
    /home/ben/sdk/go1.21.0/src/net/http/server.go:3086 +0x5cb
hpidcock commented 7 months ago

What we have here is the interplay of 3 locks.

goroutine 101 has A goroutine 101 wants B goroutine 100 has B goroutine 100 wants C goroutine 79 has C goroutine 79 wants A


goroutine 101 [sync.Mutex.Lock, 194 minutes]:
- Has services lock
- Waiting for state lock
github.com/canonical/pebble/internals/overlord/state.(*State).Lock(...)
        /build/maas/parts/pebble/build/internals/overlord/state/state.go:129
github.com/canonical/pebble/internals/overlord/servstate.taskLogf(0xc000226978?, {0x84c785, 0x22}, {0xc0000f1cd8, 0x1, 0x1})
        /build/maas/parts/pebble/build/internals/overlord/servstate/handlers.go:208 +0x86 fp=0xc0000f1c78 sp=0xc0000f1bf8 pc=0x73a386
github.com/canonical/pebble/internals/overlord/servstate.(*ServiceManager).serviceForStop(0xc000228240?, 0xc00003e000?, {0xc000226978, 0x4})
        /build/maas/parts/pebble/build/internals/overlord/servstate/handlers.go:260 +0x379 fp=0xc0000f1d10 sp=0xc0000f1c78 pc=0x73aaf9
github.com/canonical/pebble/internals/overlord/servstate.(*ServiceManager).doStop(0xc0001b5360, 0xc000504de8?, 0xc0004d6500)
        /build/maas/parts/pebble/build/internals/overlord/servstate/handlers.go:222 +0xd2 fp=0xc0000f1dd0 sp=0xc0000f1d10 pc=0x73a5d2
github.com/canonical/pebble/internals/overlord/servstate.(*ServiceManager).doStop-fm(0xc00029c371?, 0x0?)
        <autogenerated>:1 +0x29 fp=0xc0000f1df8 sp=0xc0000f1dd0 pc=0x7449a9
github.com/canonical/pebble/internals/overlord/state.(*TaskRunner).run.func1()
        /build/maas/parts/pebble/build/internals/overlord/state/taskrunner.go:195 +0xa7 fp=0xc0000f1f68 sp=0xc0000f1df8 pc=0x7172a7

goroutine 100 [sync.Mutex.Lock, 194 minutes]:
- Has state lock
- Waiting for plan lock
github.com/canonical/pebble/internals/overlord/servstate.(*ServiceManager).acquirePlan(0xc0001b5360)
        /build/maas/parts/pebble/build/internals/overlord/servstate/manager.go:229 +0x3e fp=0xc00046b300 sp=0xc00046b2c0 pc=0x74065e
github.com/canonical/pebble/internals/overlord/servstate.(*ServiceManager).StartOrder(0xc0001b5360, {0xc0003380a0, 0x1, 0x1})
        /build/maas/parts/pebble/build/internals/overlord/servstate/manager.go:377 +0x52 fp=0xc00046b370 sp=0xc00046b300 pc=0x7415f2
github.com/canonical/pebble/internals/daemon.v1PostServices(0xb42d80, 0xc0002aeb00, 0xc000338060?)
        /build/maas/parts/pebble/build/internals/daemon/api_services.go:110 +0x906 fp=0xc00046b848 sp=0xc00046b370 pc=0x756186
github.com/canonical/pebble/internals/daemon.(*Command).ServeHTTP(0xb42d80, {0x8e51b8, 0xc00030a2a0}, 0xc0002aeb00)
        /build/maas/parts/pebble/build/internals/daemon/daemon.go:272 +0x26e fp=0xc00046b8e0 sp=0xc00046b848 pc=0x758d2e
github.com/gorilla/mux.(*Router).ServeHTTP(0xc000000180, {0x8e51b8, 0xc00030a2a0}, 0xc0002ae300)
        /root/go/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210 +0x1c5 fp=0xc00046ba08 sp=0xc00046b8e0 pc=0x7028a5
github.com/canonical/pebble/internals/daemon.(*Daemon).Start.logit.func3({0x8e53f8?, 0xc0002a8380}, 0xc0002ae300)
        /build/maas/parts/pebble/build/internals/daemon/daemon.go:343 +0xd9 fp=0xc00046bb20 sp=0xc00046ba08 pc=0x75a759

goroutine 79 [sync.Mutex.Lock, 194 minutes]:
- Has plan lock
- Waiting for services lock
github.com/canonical/pebble/internals/overlord/servstate.(*ServiceManager).Services(0xc0001b5360, {0xc000416050, 0x1, 0x5?})
        /build/maas/parts/pebble/build/internals/overlord/servstate/manager.go:284 +0xaf fp=0xc0001bf748 sp=0xc0001bf508 pc=0x74086f
github.com/canonical/pebble/internals/daemon.v1GetServices(0xb42d80, 0x3?, 0xc000416010?)
        /build/maas/parts/pebble/build/internals/daemon/api_services.go:42 +0x1a8 fp=0xc0001bf848 sp=0xc0001bf748 pc=0x755588
github.com/canonical/pebble/internals/daemon.(*Command).ServeHTTP(0xb42d80, {0x8e51b8, 0xc0000123d8}, 0xc00041ab00)
        /build/maas/parts/pebble/build/internals/daemon/daemon.go:272 +0x26e fp=0xc0001bf8e0 sp=0xc0001bf848 pc=0x758d2e
github.com/gorilla/mux.(*Router).ServeHTTP(0xc000000180, {0x8e51b8, 0xc0000123d8}, 0xc00041a900)
        /root/go/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210 +0x1c5 fp=0xc0001bfa08 sp=0xc0001bf8e0 pc=0x7028a5
github.com/canonical/pebble/internals/daemon.(*Daemon).Start.logit.func3({0x8e53f8?, 0xc0001e41c0}, 0xc00041a900)
        /build/maas/parts/pebble/build/internals/daemon/daemon.go:343 +0xd9 fp=0xc0001bfb20 sp=0xc0001bfa08 pc=0x75a759
troyanov commented 7 months ago

@benhoyt thank you for the update and great investigation. net/http might be the issue indeed. I will take your branch from https://github.com/canonical/pebble/pull/350 and will try running MAAS with it and check if there will be any panics in the logs.

benhoyt commented 7 months ago

@troyanov See Harry's comment above though -- I haven't looked in detail yet, but it looks like he's found that it's a plain old deadlock / circular lock issue, so may not actually be panic-related. You're still welcome to try #350, but it may not address this issue after all.

troyanov commented 7 months ago

@benhoyt I checked that we capture both stdout and stderr and there was no panic logged, so I guess it is indeed the thing that Harry discovered.

benhoyt commented 7 months ago

A brief update. @hpidcock and I talked through this yesterday, and we have a couple of ideas about how to resolve the deadlocks. A couple of things I'm going to do to help find other instances of this:

I aim to start on this tomorrow.

benhoyt commented 6 months ago

Another update -- I have a simple repro for this now:

1) Start pebble run with a single layer containing a single service:

services:
  test:
    override: replace
    command: sleep 3600

2) Run 3 instances of this little hitty tool (mode "start", "stop", and "service"), pointing PEBBLE_SOCKET to the above. The hitty tool hits POST /v1/services (start and stop) and GET /v1/services over and over rapidly (with randomized 0-50ms intervals in between requests).

3) Wait a couple of seconds for it to deadlock and go unresponsive. It will be the same failure mode as the MAAS team's seeing.

4) Send SIGTRAP to the Pebble daemon to get it to dump its goroutine stacks. Look for one of these patterns (start or stop versions are essentially the same deadlock):

# repro 1
goroutine 410: doStart: held services, waiting for state (taskLogf)
goroutine  25: POST /v1/services (start): held state, waiting for plan
goroutine 169: GET /v1/services: held plan, waiting for services

# repro 2
goroutine 275: doStop: held services, waiting for state (taskLogf)
goroutine 117: POST /v1/services (stop): held state, waiting for plan
goroutine 135: GET /v1/services: held plan, waiting for services

I hope to break the deadlock in a couple of ways to fix this issue:

Now that I have a simple repro we'll be able to easily tell if the fixes work.

I also still hope to clean up the lock handling in other ways too, as well as add diagnostics (lock wrapping) to see if we can catch any other instances of this problem. We'll see how far I get this week. Either way, I'll aim to have a fix ready to go by the end of this pulse.

benhoyt commented 6 months ago

For future reference, https://github.com/sasha-s/go-deadlock is a pretty neat library to help find issues like this: you replace your global sync.Mutex locks with deadlock.Mutex and it does the rest: when it finds lock ordering issues or deadlock timeouts (defaults to 30s) it prints a nice dump of just the goroutines holding locks and what they were doing, then exits hard with os.Exit(2).

It's tempting to include this library and enable it by default, but I'm not sure how robust and well-tested it is, and I don't want to accidentally make the problem worse (the deadlock.Mutex code is non-trivial, and involves another gnarly library that fetches goroutine IDs). So for now I think we should just keep go-deadlock in our toolbox.

benhoyt commented 6 months ago

I plan to add a test to https://github.com/canonical/pebble/pull/359 and merge that today, then release a new Pebble version (v1.8.0) with all of the recent changes. I'll also release a v1.4.1 patch release for Juju 3.3 (which uses Pebble v1.4.0).

benhoyt commented 6 months ago

@troyanov @sparkiegeek This is fixed now and I just tagged release v1.8.0 (the Snap Store snap build is playing up, but it looks like you're not using that to build Pebble for your use case anyway). So please go ahead and use v1.8.0! Thanks for your patience on this and for sending the helpful stack trace when it did occur again!

troyanov commented 6 months ago

@benhoyt thanks a lot for helping with this one. I will bump Pebble version used in MAAS to 1.8.0

benhoyt commented 6 months ago

Sounds good. At this point we've definitely addressed the issue at hand; I can't guarantee there are no other deadlocks possible lurking in Pebble -- I'm still planning to think about how we structure locking more generally in future, to see if there are more general ways we can avoid getting into situations like this. In any case, fingers crossed, but just reach out if you see anything suspect!