TritonDataCenter / manta-mlive

Manta liveness checker
Mozilla Public License 2.0
0 stars 3 forks source link

mlive could be less frenetic about slow requests #6

Closed davepacheco closed 7 years ago

davepacheco commented 7 years ago

It's not very uncommon to see streaks like this:

2017-02-09T17:47:01.307Z: all okay (20/20 ok since 2017-02-09T17:47:00.307Z)
2017-02-09T17:47:10.752Z: corked (at 100 requests)
2017-02-09T17:47:10.804Z: uncorked (was corked for 52 ms)
2017-02-09T17:47:11.547Z: corked (at 100 requests)
2017-02-09T17:47:11.602Z: uncorked (was corked for 55 ms)
2017-02-09T17:47:13.332Z: corked (at 100 requests)
2017-02-09T17:47:13.385Z: uncorked (was corked for 53 ms)
2017-02-09T17:47:13.862Z: corked (at 100 requests)
2017-02-09T17:47:13.914Z: uncorked (was corked for 52 ms)
2017-02-09T17:47:14.377Z: corked (at 100 requests)
2017-02-09T17:47:14.429Z: uncorked (was corked for 52 ms)
2017-02-09T17:47:14.480Z: corked (at 100 requests)
2017-02-09T17:47:14.531Z: uncorked (was corked for 51 ms)
2017-02-09T17:47:14.684Z: corked (at 100 requests)
2017-02-09T17:47:14.787Z: uncorked (was corked for 103 ms)
2017-02-09T17:47:14.842Z: corked (at 100 requests)
2017-02-09T17:47:15.000Z: uncorked (was corked for 158 ms)
2017-02-09T17:47:15.460Z: corked (at 100 requests)
2017-02-09T17:47:15.512Z: uncorked (was corked for 52 ms)
2017-02-09T17:47:15.771Z: corked (at 100 requests)
2017-02-09T17:47:15.821Z: uncorked (was corked for 50 ms)
2017-02-09T17:47:16.328Z: all okay (266/266 ok since 2017-02-09T17:47:01.307Z)
2017-02-09T17:47:31.351Z: all okay (344/344 ok since 2017-02-09T17:47:16.328Z)
2017-02-09T17:47:46.369Z: all okay (327/327 ok since 2017-02-09T17:47:31.351Z)

We emit the "corked" message if our 20Hz timer fires and we're already at our request limit. We emit "uncorked" when the timer fires, we were corked, and we're now under the limit. As a result, it's easy for this to oscillate. At 100 requests, if requests take five seconds, then we'll rapidly oscillate between corked and uncorked just like this. 5 seconds is admittedly a lot, but mlive (at least today) is about identifying downtime, not high latency. It would be great to add some sort of latency report as well, but rapidly flapping like this isn't the way to do it.

davepacheco commented 7 years ago

There's already code to deal with being "stuck", which will trigger when we've not dispatched any read or write requests for 1 second. On my test deployment, I tested that this does kick in when we'd expect. Here's an annotated transcript:

$ ./bin/mlive -s 2 10
will stomp over directory at "/poseidon/stor/mlive"
will test for liveness: reads, writes
assuming 2 metadata shards
using base paths: /poseidon/stor/mlive
time between requests: 50 ms
maximum outstanding requests: 100
environment:
    MANTA_USER = poseidon
    MANTA_KEY_ID = c6:83:1f:28:c2:c3:7f:11:13:c2:49:7a:eb:ca:96:9c
    MANTA_URL = https://manta.emy-10.joyent.us
    MANTA_TLS_INSECURE = 1
creating test directory tree ... done.
2017-02-09T18:28:29.605Z: reads okay, writes okay (17/17 ok since start)
2017-02-09T18:28:39.672Z: all okay (200/200 ok since 2017-02-09T18:28:29.605Z)
2017-02-09T18:28:49.780Z: all okay (200/200 ok since 2017-02-09T18:28:39.672Z)
2017-02-09T18:28:59.883Z: all okay (200/200 ok since 2017-02-09T18:28:49.780Z)
2017-02-09T18:29:09.968Z: all okay (200/200 ok since 2017-02-09T18:28:59.883Z) <-- time T1
2017-02-09T18:29:20.028Z: all okay (185/185 ok since 2017-02-09T18:29:09.968Z) 
2017-02-09T18:29:30.110Z: all okay (177/177 ok since 2017-02-09T18:29:20.028Z)
2017-02-09T18:29:40.181Z: all okay (177/177 ok since 2017-02-09T18:29:30.110Z) <-- time T2
2017-02-09T18:29:50.270Z: all okay (258/258 ok since 2017-02-09T18:29:40.181Z)
2017-02-09T18:30:00.314Z: all okay (199/199 ok since 2017-02-09T18:29:50.270Z)
2017-02-09T18:30:10.409Z: all okay (200/200 ok since 2017-02-09T18:30:00.314Z) <-- time T3
2017-02-09T18:30:15.441Z: all stuck (66/66 ok since 2017-02-09T18:30:10.409Z)
2017-02-09T18:30:25.489Z: all stuck (0/0 ok since 2017-02-09T18:30:15.441Z) <-- time T4
2017-02-09T18:30:30.489Z: all okay (108/108 ok since 2017-02-09T18:30:25.489Z)
2017-02-09T18:30:40.565Z: all okay (204/204 ok since 2017-02-09T18:30:30.489Z)
2017-02-09T18:30:50.651Z: all okay (200/200 ok since 2017-02-09T18:30:40.565Z)
2017-02-09T18:31:00.737Z: all okay (200/200 ok since 2017-02-09T18:30:50.651Z)
2017-02-09T18:31:10.827Z: all okay (200/200 ok since 2017-02-09T18:31:00.737Z)
2017-02-09T18:31:20.947Z: all okay (200/200 ok since 2017-02-09T18:31:10.827Z)

My deployment has two haproxy processes serving the front door. Around time T1, I pstop'd one of them. We see the request count decreasing, but no other problem reported. (That's not ideal, but that's also not affected by this change.) Around time T2, I prun'd the same process, causing paused requests to complete. Around time T3, I pstop'd both haproxy processes, causing mlive to quickly become "stuck". Importantly, note that most of mlive reports are 10 seconds apart (as I configured on the command line), but the time between T3 and the next report is only 5 seconds, because it reports "stuck" after just a second of being stuck. Similarly, at T4 I prun'd both processes, and we quickly responded that they were working again.

This looks good to me -- it's a lot less frenetic, but still reports when for some reason things have actually become stuck.

davepacheco commented 7 years ago

This change is "make prepush" clean, although there aren't any automated tests for this repo.