reloadUntilOperationStatusChanged: Backoff slower

krallin commented 7 years ago

Our backoff is a little fast in reloadUntilOperationStatusChanged, and usually results in quickly having to wait several minutes for the next reload to occur.

During my testing (https://github.com/aptible/aptible-integration/pull/109), this turned out to be pretty disruptive. When provisioning a DB, we'd often end up having to wait for several minutes after the DB was deprovisioned before it showed up as provisioned.

There are two changes to our polling interval here:

First, we don't poll as often on a new record. In my (albeit limited) testing I've observed that most meaningful operations need at least 10 seconds to complete (but will complete within a few seconds if they get e.g. rate limited), so it's not that useful to poll at 1, 3 and 7 seconds. In this patch, we start the polling at 4 seconds, which seems like a decent default. I'm happy to tweak if needed.
Second, we don't backoff as fast (I chose 1.2 as the factor, as opposed to 2 that we had before).

For 10 minutes, here is the new polling "schedule" (these are the durations in second we wait between each poll, rounded to the nearest integer - the actual implementation is a little more precise):

[4, 4, 5, 6, 8, 9, 11, 14, 17, 20, 24, 29, 35, 42, 51, 61, 73, 88, 106]

For comparison, here was the old one (but: see the caveat below, in practice it'd probably wait once more after 1024 seconds):

[1, 2, 4, 8, 16, 32, 64, 128, 256, 512]

Unfortunately, it looks like the current implementation was tying the retry interval and the overall timeout together, so simply using a smaller backoff factor would have had the (presumably undesirable) side effect of causing reloadUntilOperationStatusChanged to wait a lot longer.

So, this patch uses another approach for reloadUntilOperationStatusChanged, which unties the retry interval and the timeout. This does mean that we'll timeout once the timeout set when calling reloadUntilOperationStatusChanged is met. I'm assuming this is desirable, but let me know if not.

This also fixes the call we were making to reloadUntilOperationStatusChanged when scaling a service: it turns out we were passing the operation when a timeout was expected. THis surprisingly didn't break anything, but most likely resulted in the timeout being ignored altogether.

I've tested this manually for now, since I'm merely changing the internals of this function, and I know we have tests for its observed behavior (off the top of my head, I believe we have some in the ACME specs).

cc @sandersonet @gib @fancyremarker @blakepettersson

krallin commented 7 years ago

had to adjust the ACME specs a little bit; confirms we do have indirect test coverage on this

blakepettersson commented 7 years ago

👍

aptible / dashboard.aptible.com

reloadUntilOperationStatusChanged: Backoff slower #773