Open tgross35 opened 3 months ago
Hi @tgross35 ,Thank you for bringing this issue to us. We are looking into this issue and will update you on this issue after investigating.
Thank you for the response. If you need to watch active jobs there is always one running at https://github.com/rust-lang-ci/rust/actions (mostly the auto
branch). x86_64-msvc-ext
dist-x86_64-msvc-alt
dist-x86_64-msvc
all fail commonly, usually between 1 and 2 hours of job start. -ext
seems to be the most common failure.
There are also ongoing experiments to run the jobs multiple time with different tweaks and see what fails, e.g. https://github.com/rust-lang/rust/pull/129504 and https://github.com/rust-lang/rust/pull/129522.
@vidyasagarnimmagaddi is there something we could do to debug this better? Our failure rate is currently over 50% due to this issue.
Somebody was able to confirm that we encounter this issue even running CI on an older state of our repo (from before this problem was noticed), which does seem to indicate it is caused by a change to the runner environment rather than changes to our code.
@tgross35 - sure, we will update you shortly to provide workaround/solution to the issue.
@ijunaidm Thanks! I'm one of the people working on this on the Rust side. Another data point: I've never been able to reproduce this on the windows-2022
runner. I've only been able to do it on the windows-2022-8core-32gb
runner. I don't know if that is just because of performance (windows-2022
might be too slow to trip whatever race condition), or if it is fundamental to differences in the image. One thing I noticed is that windows-2022-8core-32gb
uses the C:
drive whereas windows-2022
uses the D:
drive. I'm not sure if that is relevant at all.
@ijunaidm are there any updates here, or are you able to help us debug in some way (e.g. provide a way to ssh into active runners)? We were forced to switch to the small runners which seems to make this issue less prevalent (still very common) but need to move back to the large runners at some point.
@tgross35 - Sorry, i will update you shortly on this issue.
Hi @tgross35 - I was going through the runner logs via the link provided and found that the workflow is attempting to delete "miri.exe", which might be busy when this attempt is being made.
If possible exclude the deletion step from the workflow as for every new workflow run Github provisions new runners. Otherwise, make the pipeline wait to ensure the miri.exe is not busy, which can be another cause of access denied on deletion.
@subir0071 thank you for reaching out.
If possible exclude the deletion step from the workflow as for every new workflow run Github provisions new runners.
We create this file during the run and later need to delete it, so eliminating this step is not possible. Several people from rust-lang have already tried workarounds, including:
Further, this issue doesn't always take the same form; sometimes it is failure to remove a different file, sometimes it is failure to open files.
Otherwise, make the pipeline wait to ensure the miri.exe is not busy, which can be another cause of access denied on deletion.
This is our main hypothesis for what is going on, but it doesn't seem to be us holding the file (meaning maybe there is some indexing, antivirus, or monitoring that was added within the past few months causing the problem). We have exhausted everything possible to figure out why miri.exe would be busy. The PRs linked above have some attempts at this, see also e.g. https://github.com/rust-lang/rust/issues/127883#issuecomment-2336535344.
We cannot really do any further debugging here without GitHub's help. We need a way to do something like SSH into one of the runner images or create a kernel dump, or have somebody from GitHub do this for us while we run jobs. Is this possible?
@jasagredo, I think this will look familiar to you.
@geekosaur Not exactly, our issue was with permissions in read-only files that prevented Windows from deleting them. This seems to be related to Windows holding the lock for a file in use for too long.
@tgross35
SSH into one of the runner images
Maybe it is not very useful but there is https://github.com/mxschmitt/action-tmate which allows you to ssh into a runner, although I'm not sure of how elevated are the permissions there so you might not be able to dump protected stuff.
Description
For the past few months, the
rust-lang/rust
project has had a lot of spurious failures on the Windows runners. These are typically either failure to open a file (mostly fromlink.exe
) or failure to remove a file:LINK : fatal error LNK1104: cannot open file ...
error: failed to remove file ..., Access is denied (os error 5)
Example run: https://github.com/rust-lang-ci/rust/actions/runs/10537107932/job/29198090275
Is it possible that something changed that would cause this? Even if not and this is a problem with our tooling, we could use assistance debugging.
Further context, links to failed jobs, and attempts to reproduce are at https://github.com/rust-lang/rust/issues/127883. Almost every PR showing up in the mentions list is from one of these failures. These errors are similar to what was reported in https://github.com/actions/runner-images/issues/4086.
Cc @ChrisDenton and @ehuss who have been working to reproduce this.
Platforms affected
Runner images affected
Image version and build link
Is it regression?
Yes, around 2024-06-27 but the exact start is unknown. It has seemingly gotten significantly worst in the past week or so, that job has at least a 25% failure rate from this issue in the past couple of days (probably close to 50%).
Expected behavior
Accessing or removing the files should succeed.
Actual behavior
The file operations are encountering spurious failures, as linked above.
Repro steps
No known consistent reproduction.