canonical / pebble

Pebble is a lightweight Linux service manager with layered configuration and an HTTP API.
https://canonical-pebble.readthedocs-hosted.com/
GNU General Public License v3.0
146 stars 54 forks source link

fix(daemon): avoid acquiring state lock in health check endpoint #369

Closed benhoyt closed 7 months ago

benhoyt commented 7 months ago

This PR avoids acquiring state lock in health check endpoint, a localised fix for the issue described in https://bugs.launchpad.net/juju/+bug/2052517 and #366.

I don't love the addition of the (exported) State.LockCount method, but I couldn't think of another way to ensure that State.Lock was not being called, apart from significant surgery allowing us to swap out the state lock with an interface.

In addition, I'm not sure we need the maintenance/warnings stuff at all in Pebble, as they're not used, but leaving that for a separate discussion and PR.

I tested by saving the script https://go.dev/play/p/LJDLEaXeBpd to cmd/hithealth/main.go, running a Pebble server, and then running a task to heavily load my CPU cores (though the latter doesn't seem to make much of a difference!).

Without the fix, GET /v1/health response times are all over 20ms, and commonly in the 100-300ms range. In other words, very long for a do-nothing health check endpoint. This is because the other clients (in the hithealth script) are starting services which modifies state and holds the state lock for a relatively long time.

$ go run ./cmd/hithealth/
     0 > 20ms: 0.028897s
     1 > 20ms: 0.192796s
     2 > 20ms: 0.268904s
     3 > 20ms: 0.082985s
     4 > 20ms: 0.030554s
     5 > 20ms: 0.156912s
     6 > 20ms: 0.245212s
     7 > 20ms: 0.047436s
     8 > 20ms: 0.099474s
     9 > 20ms: 0.070440s
    10 > 20ms: 0.060641s
    11 > 20ms: 0.202535s
    12 > 20ms: 0.206226s
    ...

With the fix, GET /v1/health doesn't hold the state lock at all. Response times are only over 20ms every 10,000 or so requests, and never above 100ms even under heavy load:

$ go run ./cmd/hithealth/
 13891 > 20ms: 0.023703s
 15923 > 20ms: 0.024769s
 21915 > 20ms: 0.076423s
 ...
flotter commented 7 months ago

@benhoyt I only have a general question. The amount of latency on the health requested state lock (~200ms) in the worse case seems quote bad. I wonder if the WaitChange() that is also part of the service starting loop (10x concurrently) in your test script is is also attributing to state lock contention. I am probably missing some data on why this happens, but I would be surprised if any one state lock user holds the lock for a very long time, so even if a lot of state lock operations are interleaved, I am still a bit surprised that a single state lock request has to wait that long (even the average feels very long).

benhoyt commented 7 months ago

@flotter Thanks for the review. Yes, the WaitChange() could well be part of the contention, but I think that's okay -- it just adds "load" to the system and to the state lock. Here it's mainly the health check endpoint we're concerned about making super-fast. It's the State.Unlock() method that is slow, but only when state has been modified and needs to be written.

I did some further timing tests of the different slow parts of State.unlock, recording the mean time on my emulated "loaded system" for the JSON marshal and the mean time for the checkpoint (which uses osutil.AtomicWrite to write the bytes to disk and calls fsync), and here are the timings:

with fsync: avg marshal=0.011758 checkpoint=0.015349
  no fsync: avg marshal=0.010221 checkpoint=0.000594
with fsync: max marshal=0.068619 checkpoint=0.032137
  no fsync: max marshal=0.067646 checkpoint=0.000805

Two things to note: 1) It's quite slow just to do the JSON marshalling (this is because state has grown quite large, like several hundred KB of JSON, with quite a few Changes being recorded rapidly), and 2) The checkpoint/save is slow, not due to the file writing, but due to the fsync.

I'd like to consider improving either or both of these separately, for example not doing the fsync (maybe a bad idea?) or only saving every (say) 100ms at most. Or moving to a real database like SQLite... :-)

benhoyt commented 7 months ago

Going to merge this as is, seeing this is a fairly high-priority issue -- we can always improve on this implementation detail later.