Our backoff is a little fast in reloadUntilOperationStatusChanged, and
usually results in quickly having to wait several minutes for the next
reload to occur.
During my testing
(https://github.com/aptible/aptible-integration/pull/109), this turned
out to be pretty disruptive. When provisioning a DB, we'd often end up
having to wait for several minutes after the DB was deprovisioned before
it showed up as provisioned.
There are two changes to our polling interval here:
First, we don't poll as often on a new record. In my (albeit limited)
testing I've observed that most meaningful operations need at least 10
seconds to complete (but will complete within a few seconds if they
get e.g. rate limited), so it's not that useful to poll at 1, 3 and 7
seconds. In this patch, we start the polling at 4 seconds, which seems
like a decent default. I'm happy to tweak if needed.
Second, we don't backoff as fast (I chose 1.2 as the factor, as
opposed to 2 that we had before).
For 10 minutes, here is the new polling "schedule" (these are the
durations in second we wait between each poll, rounded to the nearest
integer - the actual implementation is a little more precise):
For comparison, here was the old one (but: see the caveat below, in
practice it'd probably wait once more after 1024 seconds):
[1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
Unfortunately, it looks like the current implementation was tying the
retry interval and the overall timeout together, so simply using a
smaller backoff factor would have had the (presumably undesirable) side
effect of causing reloadUntilOperationStatusChanged to wait a lot
longer.
So, this patch uses another approach for
reloadUntilOperationStatusChanged, which unties the retry interval and
the timeout. This does mean that we'll timeout once the timeout set when
calling reloadUntilOperationStatusChanged is met. I'm assuming this is
desirable, but let me know if not.
This also fixes the call we were making to
reloadUntilOperationStatusChanged when scaling a service: it turns out
we were passing the operation when a timeout was expected. THis
surprisingly didn't break anything, but most likely resulted in the
timeout being ignored altogether.
I've tested this manually for now, since I'm merely changing the
internals of this function, and I know we have tests for its observed
behavior (off the top of my head, I believe we have some in the ACME
specs).
cc @sandersonet @gib @fancyremarker @blakepettersson
Our backoff is a little fast in reloadUntilOperationStatusChanged, and usually results in quickly having to wait several minutes for the next reload to occur.
During my testing (https://github.com/aptible/aptible-integration/pull/109), this turned out to be pretty disruptive. When provisioning a DB, we'd often end up having to wait for several minutes after the DB was deprovisioned before it showed up as provisioned.
There are two changes to our polling interval here:
For 10 minutes, here is the new polling "schedule" (these are the durations in second we wait between each poll, rounded to the nearest integer - the actual implementation is a little more precise):
For comparison, here was the old one (but: see the caveat below, in practice it'd probably wait once more after 1024 seconds):
Unfortunately, it looks like the current implementation was tying the retry interval and the overall timeout together, so simply using a smaller backoff factor would have had the (presumably undesirable) side effect of causing reloadUntilOperationStatusChanged to wait a lot longer.
So, this patch uses another approach for reloadUntilOperationStatusChanged, which unties the retry interval and the timeout. This does mean that we'll timeout once the timeout set when calling reloadUntilOperationStatusChanged is met. I'm assuming this is desirable, but let me know if not.
This also fixes the call we were making to reloadUntilOperationStatusChanged when scaling a service: it turns out we were passing the operation when a timeout was expected. THis surprisingly didn't break anything, but most likely resulted in the timeout being ignored altogether.
I've tested this manually for now, since I'm merely changing the internals of this function, and I know we have tests for its observed behavior (off the top of my head, I believe we have some in the ACME specs).
cc @sandersonet @gib @fancyremarker @blakepettersson