dlang / ci

dlang CI testing pipelines
https://buildkite.com/dlang
Boost Software License 1.0
13 stars 29 forks source link

Increase the number of retries (1->3) #444

Closed Geod24 closed 3 years ago

Geod24 commented 3 years ago

CC @WalterBright

dlang-bot commented 3 years ago

Thanks for your pull request, @Geod24!

WalterBright commented 3 years ago

@Geod24 I hope this won't cause a retry if a test suite failure is actually caused by a bug and not a networking failure?

Geod24 commented 3 years ago

@WalterBright : That's the downside of it - it will.

WalterBright commented 3 years ago

The obvious question - can we get a proper fix?

PetarKirov commented 3 years ago

Given that for a human it can be difficult to decide whether a bug is Heisenbug or not, what would be the algorithm to determine that automatically? "Agent lost" and OOM should ideally be handled by BuildKite, but for other things it's hard to say.

MartinNowak commented 3 years ago

Fine for me, the bill for those runners is fairly small. Did lower it to 1 in the past since many PR problems were not intermittent, but some are and human time is quite valuable.

Geod24 commented 3 years ago

@MartinNowak : Perhaps you could take a look at https://github.com/dlang/ci/blob/master/buildkite/Dockerfile so contributors could run an agent as well ? I have a few servers that I would gladly use as permanent runners.

WalterBright commented 3 years ago

the algorithm

All networking errors would be a great first approximation.

PetarKirov commented 3 years ago

All networking errors would be a great first approximation.

Obviously, yes. The question is how to determine if a failure is networking related. For example, IIRC some (all?) std.socket unit tests run on localhost, so there internet access is not a prerequisite. A build could fail because dub can't fetch a dependency, which could be either caused by code.dlang.org (and its mirrors) being down, or it could be that the project being built was looking for a non-existing version, etc. While a restart is likely to resolve the first cause, it is unlikely to help with second one. The high-level idea is clear, but implementation not so much, especially given that we're running the test suites of third-party projects. Also, IIRC, in the past several months, network related problems where much more rare than say a mismatch between compiler and druntime version occuring when the dlang/druntime project is build on BuildKite.

WalterBright commented 3 years ago

Over here https://github.com/dlang/dmd/pull/12409#issuecomment-818246599 the failure is:

CI agent stopped responding!

Surely that's detectable.

MartinNowak commented 3 years ago

There seems to be almost zero benefit for a smart retry over a 3x blunt retry, won't even be noticeably faster. I'd suggest to just stick with the approach here instead of wasting time on dealing with a huge error surface.

Over here dlang/dmd#12409 (comment) the failure is:

CI agent stopped responding!

IIRC there is a 5 min. wait-time for running jobs when downscaling agents. https://github.com/dlang/ci/blob/338cfde88e4aa162e5e8110dcc5f8c24e4539b01/ansible/roles/buildkite_agent/defaults/main.yml#L4 If the problem occurs often, we could bump that a bit if there are many long-running jobs.

@MartinNowak : Perhaps you could take a look at https://github.com/dlang/ci/blob/master/buildkite/Dockerfile so contributors could run an agent as well ? I have a few servers that I would gladly use as permanent runners.

What's the benefit of someone else running servers? Sounds nice in theory, but reliability on a heterogeneous infrastructure run by an uncoordinated group is likely to suffer.

Perhaps you could take a look at https://github.com/dlang/ci/blob/master/buildkite/Dockerfile so contributors could run an agent as well?

I guess a simpler dependency file might indeed help us to update the machines. Is this a real problem? Could try to find some time when possible, but cannot promise anything.

WalterBright commented 3 years ago

@MartinNowak thanks for the evaluation. I'll defer to your expertise in the matter!

MartinNowak commented 3 years ago

I guess a simpler dependency file might indeed help us to update the machines. Is this a real problem? Could try to find some time when possible, but cannot promise anything.

Any opinion on whether this is an actual problem @Geod24?

Geod24 commented 3 years ago

@MartinNowak : The lack of machine has definitely hit us in the past. Sometimes there are no agents running for a visible amount of time, although I don't recall any time when it was more than an hour. I wasn't overly bothered by it because I just hit the retry button but @WalterBright was.

Geod24 commented 3 years ago

Something that is a bit more lacking is the ability for projects to control their dependencies. With the changes we're seeing in the CI ecosystem (Travis disappearing, Github CI raising) I was hoping we could leverage the Github runner to simplify our current pipeline. That could theoretically make it easier for core contributors to run agents, too. I know that control over dependencies has prevented me for adding our projects here.

MartinNowak commented 3 years ago

Something that is a bit more lacking is the ability for projects to control their dependencies. With the changes we're seeing in the CI ecosystem (Travis disappearing, Github CI raising) I was hoping we could leverage the Github runner to simplify our current pipeline. That could theoretically make it easier for core contributors to run agents, too. I know that control over dependencies has prevented me for adding our projects here.

Indeed we could rebuild the service in GitHub Actions :+1:, might be more accessible for everyone, would require some additional setup time (hopefully fine). Not sure how long their free open source CI will last, I'd guess a while with MSFTs current strategy.