Improve chatops smoketest resiliency on travis

StackStorm / ansible-st2

Ansible Roles and Playbooks to deploy StackStorm

https://galaxy.ansible.com/StackStorm/stackstorm/

Apache License 2.0

100 stars 75 forks source link

Improve chatops smoketest resiliency on travis #255

Closed cognifloyd closed 2 years ago

cognifloyd commented 4 years ago

This is an attempt to make the travis tests more resilient when running the hubot tests in parallel. ~This will wait for 300 seconds for the port to close (no longer be in use) to avoid the EADDRINUSE error.~ As @armab pointed out, the EADDRINUSE error is a red herring. The real issue is slack Rate limiting. So this adds a retry until to hopefully make more CI runs pass without intervention.

cognifloyd commented 4 years ago

@armab I think this should reduce the impact of slack rate limiting. wdyt?

arm4b commented 4 years ago

Yes, that's a good use of retry with delays. Should definitely improve it :+1:

cognifloyd commented 4 years ago

It looks like this actually made it worse.

cognifloyd commented 4 years ago

K. I think this is ready now. I updated it to have a more exponential-ish back off, and randomized the delay between retries slightly.

cognifloyd commented 4 years ago

Hmm. I finally tripped the retry logic https://travis-ci.org/StackStorm/ansible-st2/builds/635053348 1 build succeeded first try 7 builds triggered the retry logic 3/7 succeeded 4/7 failed

So it's not quite right yet. The command is only templated once, so using attempts to simulate an exponential backoff doesn't work.

cognifloyd commented 4 years ago

With the version in 661b037: https://travis-ci.org/StackStorm/ansible-st2/builds/635132037 1 build succeeded on the first try 7 triggered the retry logic 5/7 succeeded 2/7 failed https://travis-ci.org/StackStorm/ansible-st2/builds/635132304 8 builds triggered the retry logic 7/8 succeeded 1/8 failed

So, this is much more reliable. The latest commit b0f9695 extends the sleep (only during retries) to hopefully catch the last few builds hitting the rate limit. This is already much better, and with this extra bit of sleep, I think it's good enough. Hopefully we won't have to kick travis as much.

Note that if a build succeeds on the first try, then it should be just as it is now (timeout 10 and sleep 5). Those times are only increased for retries.

cognifloyd commented 4 years ago

That is very disappointing. I seem to trigger these errors every time I try to rebase my PRs, and then they sit around with failed CI again.

The travis auto-cancel feature has been problematic for me on other projects. It will only cancel builds that have NOT started building. So if nothing is running, you update a PR, it starts building, and then you update the PR again a few minutes later, that first build will still have to complete before the next one starts. So, auto-cancel might reduce some pressure, but I doubt it'll make much difference.

cognifloyd commented 4 years ago

Note that the Build History (triggered on a schedule or by merges to master) will only have failures if PRs get pushed at the same time as one of those builds happens to run. Because they are infrequent enough, you will rarely see errors there. The parallel build pressure is only a significant issue for PR authors. To see that history you need to look under "Pull Requests". Almost all of the failures in this screenshot are due to slack rate limiting errors, not actual errors with the PR:

cognifloyd commented 4 years ago

It looks like you've lowered the max concurrent jobs from 4 to 2. :+1: Thanks! Hopefully jobs won't error on me so often.

nmaludy commented 4 years ago

@cognifloyd did you mean to commit the file asdf ?

cognifloyd commented 4 years ago

@cognifloyd did you mean to commit the file asdf ?

LOL. no.

cognifloyd commented 4 years ago

I just rebased, removed the asdf file (oops), and added comments to explain how the backoff algorithm works.