JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.54k stars 5.47k forks source link

CI: win32 worker exits with non-zero error code shortly after being started #54096

Closed d-netto closed 2 months ago

d-netto commented 5 months ago

Saw this on https://buildkite.com/julialang/julia-master/builds/35570#018ee359-5c7a-43fc-8309-a24a571e8a38 and https://buildkite.com/julialang/julia-master/builds/35570#018ee395-c4e2-4153-9db3-9fe9822e3bff.

Not sure if it's transient.

giordano commented 5 months ago

We've had that for months now

d-netto commented 3 months ago

Any updates on this?

Saw it happening again on https://buildkite.com/julialang/julia-master/builds/37743#019056c0-be81-462a-8e83-bce634b93f28.

DilumAluthge commented 3 months ago

IIRC, @staticfloat and others have spent a lot of time looking into this, and so far we still don't know what the underlying problem is.

In the short-term, the workaround is likely going to be to just manually retry that job when it fails.

d-netto commented 3 months ago

Thanks for the clarification.

DilumAluthge commented 3 months ago

Another workaround that I think would be nice to implement:

If a Windows job fails, and the runtime of the job was <= 60 seconds, automatically retry the job, up to a maximum of N total tries (for a reasonable value of N). However, if a Windows job fails, and the runtime of the job was > 60 seconds, then don't retry the job.

The hard part (the part that I don't know how to implement) is to gate the auto retry on the job duration. Because we don't want to unconditionally retry all failed Windows jobs, just the short ones.

Keno commented 2 months ago

I don't know where this was written down, but the next step on this issue was to run peflags -v bash.exe on the .exe file in our windows images and see if high-entropy-va is set.

Keno commented 2 months ago

Ah, we did look into it. Should have been fixed by https://github.com/JuliaCI/rootfs-images/pull/250.

Keno commented 2 months ago

We still have more intermittent windows issues, but let's open new issues for those to segragate failure logs after that change.