Closed cognifloyd closed 2 years ago
@armab I think this should reduce the impact of slack rate limiting. wdyt?
Yes, that's a good use of retry with delays. Should definitely improve it :+1:
It looks like this actually made it worse.
K. I think this is ready now. I updated it to have a more exponential-ish back off, and randomized the delay between retries slightly.
Hmm. I finally tripped the retry logic https://travis-ci.org/StackStorm/ansible-st2/builds/635053348 1 build succeeded first try 7 builds triggered the retry logic 3/7 succeeded 4/7 failed
So it's not quite right yet. The command is only templated once, so using attempts to simulate an exponential backoff doesn't work.
With the version in 661b037: https://travis-ci.org/StackStorm/ansible-st2/builds/635132037 1 build succeeded on the first try 7 triggered the retry logic 5/7 succeeded 2/7 failed https://travis-ci.org/StackStorm/ansible-st2/builds/635132304 8 builds triggered the retry logic 7/8 succeeded 1/8 failed
So, this is much more reliable. The latest commit b0f9695 extends the sleep (only during retries) to hopefully catch the last few builds hitting the rate limit. This is already much better, and with this extra bit of sleep, I think it's good enough. Hopefully we won't have to kick travis as much.
Note that if a build succeeds on the first try, then it should be just as it is now (timeout 10 and sleep 5). Those times are only increased for retries.
That is very disappointing. I seem to trigger these errors every time I try to rebase my PRs, and then they sit around with failed CI again.
The travis auto-cancel feature has been problematic for me on other projects. It will only cancel builds that have NOT started building. So if nothing is running, you update a PR, it starts building, and then you update the PR again a few minutes later, that first build will still have to complete before the next one starts. So, auto-cancel might reduce some pressure, but I doubt it'll make much difference.
Note that the Build History (triggered on a schedule or by merges to master) will only have failures if PRs get pushed at the same time as one of those builds happens to run. Because they are infrequent enough, you will rarely see errors there. The parallel build pressure is only a significant issue for PR authors. To see that history you need to look under "Pull Requests". Almost all of the failures in this screenshot are due to slack rate limiting errors, not actual errors with the PR:
It looks like you've lowered the max concurrent jobs from 4 to 2. :+1: Thanks! Hopefully jobs won't error on me so often.
@cognifloyd did you mean to commit the file asdf
?
@cognifloyd did you mean to commit the file
asdf
?
LOL. no.
I just rebased, removed the asdf
file (oops), and added comments to explain how the backoff algorithm works.
This is an attempt to make the travis tests more resilient when running the hubot tests in parallel. ~This will wait for 300 seconds for the port to close (no longer be in use) to avoid the EADDRINUSE error.~ As @armab pointed out, the EADDRINUSE error is a red herring. The real issue is slack Rate limiting. So this adds a retry until to hopefully make more CI runs pass without intervention.