Add backoff strategy to auto retry of failing CI jobs

erikmd commented 1 year ago

Salut @Zimmi48

FYI despite the line https://github.com/coq/bot/blob/master/src/actions.ml#L545 over the 49 jobs of docker-mathcomp https://gitlab.inria.fr/math-comp/docker-mathcomp/-/pipelines/833107/ there's always a dozen of jobs that fail, and that we have to retry by hand cf. the list of all jobs https://gitlab.inria.fr/math-comp/docker-mathcomp/-/pipelines/833107/builds

Three questions:

can you verify that the rule || test "unexpected status code .*: 401 Unauthorized" has indeed be used according to the logs, e.g. after this job deploy_1_2.0.0-coq-8.16 https://gitlab.inria.fr/math-comp/docker-mathcomp/-/jobs/3279705
if yes, what happens if coqbot has reached the retry threshold, and we manually restart the job in gitlab ci? → I'd expect that the retry threshold count is reinitialized, so that coqbot can retry anew as if no failure had occur beforehand.
in any case, I guess we could implement some exponential backoff strategy, maybe this algorithm?

With kind regards

Zimmi48 commented 1 year ago

I can indeed see a bunch of lines like this in the log:

Jul 11 21:28:30Z Failed job 3279753 of project 44938.
Jul 11 21:28:30Z Failure reason: runner_system_failure
Jul 11 21:28:30Z Runner failure reported by GitLab CI... Checking whether to retry the job.
Jul 11 21:28:30Z The job has been retried less than three times before (number of retries = 0). Retrying...

Then:

Jul 11 21:39:28Z Connectivity issue... Checking whether to retry the job.
Jul 11 21:39:29Z The job has been retried less than three times before (number of retries = 1). Retrying...
Jul 11 21:39:29Z Response code: 201.

Then:

Jul 11 22:13:39Z Connectivity issue... Checking whether to retry the job.
Jul 11 22:13:39Z The job has been retried less than three times before (number of retries = 2). Retrying...
Jul 11 22:13:40Z Response code: 201.

Then:

Jul 11 22:43:37Z Connectivity issue... Checking whether to retry the job.
Jul 11 22:43:38Z The job has been retried 3 times before. Not retrying.

if yes, what happens if coqbot has reached the retry threshold, and we manually restart the job in gitlab ci? → I'd expect that the retry threshold count is reinitialized, so that coqbot can retry anew as if no failure had occur beforehand.

The retry threshold is not reinitialized because it is a count that is provided by GitLab. It doesn't distinguish who did the retry. However, adopting an exponential backoff strategy based on the number of retries should be quite doable. In this case, we may be able to increase the maximum number of retries.

I don't think that it is necessary to write an algorithm as complex as the one proposed by Google though. We do have some exponential backoff strategies in place in other parts of the code already:

https://github.com/coq/bot/blob/77b3898f4b26cce477bbe41087a0095448fa9f23/src/actions.ml#L2594-L2621

erikmd commented 1 year ago

Thanks @Zimmi48 for your feedback! 👍👍

The retry threshold is not reinitialized because it is a count that is provided by GitLab.

OK, but then I don't see why it would be impossible to implement the feature I was suggesting:

assuming the max_number_of_retry = 3 (retry after 5s, 25s, 125s, then stop) and count := 1 initially, one could have a predicate making it possible to retry manually, and relaunch a series of exponentiel backoff: e.g.,

let should_auto_retry count =
  (count - 1) mod (max_number_of_retry + 1) < max_number_of_retry

as a result

failed with count	should_auto_retry?
1	true
2	true
3	true
4	false (manual retry needed)
5	true
6	true
7	true
8	false …

IMO, it would be great to be able to combine this modular-arithmetic idea with the (5s, 25s, 125s) backoff strategy you mention.

WDYT?

Zimmi48 commented 1 year ago

Relying on modular arithmetic for this purpose is really smart :open_mouth: Yes, I think what you propose would work!

erikmd commented 1 year ago

Thanks @Zimmi48 !

BTW while we are on it, even if we are sure there will be no "looping", maybe it would be more "safe" to add another bound, to ensure that the automatic retry by coqbot won't exceed a reasonable bound?

e.g. if max_number_of_retry = 3 and max_number_of_retry_series = 4,

let should_auto_retry count =
  let c = count - 1 in
  let m = max_number_of_retry + 1 in
  let q, r = c / m, c mod m in
  r < max_number_of_retry && q < max_number_of_retries_series

but YMMV!

silene commented 1 year ago

Copied from Zulip:

It is great that you have exponential backoff, but you are missing an important piece. You need to make the sleep length random. For example, if your current backoff is 25 seconds, you should not sleep for 25 seconds; you should sleep between 25 and 25*2=50 seconds. Otherwise, your 49 failing jobs will all wake up at the exact same time, and thus they will fail again.

erikmd commented 4 months ago

FTR, today (2024-07-02) in a 1-job pipeline, gitlab.inria.fr failed with:

https://gitlab.inria.fr/math-comp/docker-mathcomp/-/jobs/4509247

(...)
$ pwd
/builds/2mk6rsew/0/math-comp/docker-mathcomp
$ /usr/bin/env bash -e -c ' # collapsed multi-line command
bash
Error response from daemon: Get "https://registry-1.docker.io/v2/": received unexpected HTTP status: 502 Bad Gateway

Restarted the job manually

erikmd commented 4 months ago

FYI @Zimmi48 : @palmskog and I we got new 502 errors at docker pull time with this message ↑

I believe this form of the error message is not covered in https://github.com/coq/bot/blob/53ef64c8e5930119d6fe885d5d4828ce68aceb2e/src/actions.ml#L510-L521

do you think we should add || test "received unexpected HTTP status: 502 or || test "received unexpected HTTP status: 502.* in coq/bot ?

Edit: I opened a PR with a more general regexp.

coq / bot

Add backoff strategy to auto retry of failing CI jobs #289