actions / runner

The Runner for GitHub Actions :rocket:
https://github.com/features/actions
MIT License
4.89k stars 959 forks source link

GitHub self-hosted Windows runners randomly fail to clean up `_actions` directory #2687

Open huydhn opened 1 year ago

huydhn commented 1 year ago

Describe the bug GitHub self-hosted Windows runners randomly fail to clean up _actions directory in the Set up job step with the following error: The process cannot access the file 'C:\actions-runner\_work\_actions\...' because it is being used by another process

To Reproduce This error happens randomly and I couldn't reproduce it reliably. On the other hand, it happens daily on our https://github.com/pytorch/pytorch repo CI, for example https://github.com/pytorch/pytorch/actions/runs/5514008233/jobs/10052783742

Expected behavior Set up job step completes successfully. As far as I know, there is no external process locking the directories used by the runner and Windows Defender has been removed, so I want to check if it could be something coming from the runner itself as the issue only manifests in the past few months.

Runner Version and Platform

Version of your runner? 2.305.0

OS of the machine running the runner? Windows

Windows git version? git version 2.41.0.windows.1

What's not working?

Set up job step randomly fail to clean up _actions directory on Windows runners. The exception comes from this step in the worker https://github.com/actions/runner/blob/main/src/Runner.Worker/ActionManager.cs#L84-L87

Job Log Output

2023-07-11T17:38:25.9850443Z Requested labels: windows.4xlarge.nonephemeral
2023-07-11T17:38:25.9850651Z Job defined at: pytorch/pytorch/.github/workflows/_win-test.yml@refs/heads/main
2023-07-11T17:38:25.9850908Z Reusable workflow chain:
2023-07-11T17:38:25.9851090Z pytorch/pytorch/.github/workflows/trunk.yml@refs/heads/main (49a2b72927019ccd9b4ec5a4c77defc874373daf)
2023-07-11T17:38:25.9851289Z -> pytorch/pytorch/.github/workflows/_win-test.yml@refs/heads/main (49a2b72927019ccd9b4ec5a4c77defc874373daf)
2023-07-11T17:38:25.9851467Z Waiting for a runner to pick up this job...
2023-07-11T17:38:26.1887973Z Job is about to start running on the runner: i-058cf80bd552786a9 (organization)
2023-07-11T17:38:30.0613019Z Current runner version: '2.305.0'
2023-07-11T17:38:30.0618740Z Runner name: 'i-058cf80bd552786a9'
2023-07-11T17:38:30.0619289Z Runner group name: 'Default'
2023-07-11T17:38:30.0619819Z Machine name: 'EC2AMAZ-6LOQP3S'
2023-07-11T17:38:30.0621954Z ##[group]GITHUB_TOKEN Permissions
2023-07-11T17:38:30.0622458Z Actions: write
2023-07-11T17:38:30.0622680Z Checks: write
2023-07-11T17:38:30.0622984Z Contents: write
2023-07-11T17:38:30.0623255Z Deployments: write
2023-07-11T17:38:30.0623502Z Discussions: write
2023-07-11T17:38:30.0623923Z Issues: write
2023-07-11T17:38:30.0624207Z Metadata: read
2023-07-11T17:38:30.0624487Z Packages: write
2023-07-11T17:38:30.0624764Z Pages: write
2023-07-11T17:38:30.0625165Z PullRequests: write
2023-07-11T17:38:30.0625462Z RepositoryProjects: write
2023-07-11T17:38:30.0625721Z SecurityEvents: write
2023-07-11T17:38:30.0625994Z Statuses: write
2023-07-11T17:38:30.0626300Z ##[endgroup]
2023-07-11T17:38:30.0629377Z Secret source: Actions
2023-07-11T17:38:30.0629863Z Prepare workflow directory
2023-07-11T17:38:31.7674286Z ##[error]The process cannot access the file 'C:\actions-runner\_work\_actions\pytorch\pytorch\main\aten\src\ATen\native\quantized\cpu\qnnpack\wrappers\u8rmax' because it is being used by another process.

Runner and Worker's Diagnostic Logs

There are some related worker logs:

[2023-07-11 17:38:29Z INFO JobRunner] Job ID 8a40b056-4605-5299-54fb-5b28f1300fcc
[2023-07-11 17:38:29Z INFO JobRunner] Creating job server with URL: https://pipelines.actions.githubusercontent.com/mBh68xKhi8LyM7tp3vECvYXNFvuV4gyVGgmYCteuEZP9JH92QN/
[2023-07-11 17:38:29Z INFO ConfigurationStore] currentAssemblyLocation: C:\actions-runner\bin\Runner.Worker.dll
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Bin': 'C:\actions-runner\bin'
[2023-07-11 17:38:29Z INFO ConfigurationStore] binPath: C:\actions-runner\bin
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Bin': 'C:\actions-runner\bin'
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Root': 'C:\actions-runner'
[2023-07-11 17:38:29Z INFO ConfigurationStore] RootFolder: C:\actions-runner
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Bin': 'C:\actions-runner\bin'
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Root': 'C:\actions-runner'
[2023-07-11 17:38:29Z INFO HostContext] Well known config file 'Runner': 'C:\actions-runner\.runner'
[2023-07-11 17:38:29Z INFO ConfigurationStore] ConfigFilePath: C:\actions-runner\.runner
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Bin': 'C:\actions-runner\bin'
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Root': 'C:\actions-runner'
[2023-07-11 17:38:29Z INFO HostContext] Well known config file 'Credentials': 'C:\actions-runner\.credentials'
[2023-07-11 17:38:29Z INFO ConfigurationStore] CredFilePath: C:\actions-runner\.credentials
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Bin': 'C:\actions-runner\bin'
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Root': 'C:\actions-runner'
[2023-07-11 17:38:29Z INFO HostContext] Well known config file 'MigratedCredentials': 'C:\actions-runner\.credentials_migrated'
[2023-07-11 17:38:29Z INFO ConfigurationStore] MigratedCredFilePath: C:\actions-runner\.credentials_migrated
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Bin': 'C:\actions-runner\bin'
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Root': 'C:\actions-runner'
[2023-07-11 17:38:29Z INFO HostContext] Well known config file 'Service': 'C:\actions-runner\.service'
[2023-07-11 17:38:29Z INFO ConfigurationStore] ServiceConfigFilePath: C:\actions-runner\.service
[2023-07-11 17:38:29Z INFO ConfigurationStore] Read setting file: 297 chars
[2023-07-11 17:38:29Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[2023-07-11 17:38:29Z INFO Worker] Listening for cancel message from the channel.
[2023-07-11 17:38:29Z INFO Worker] Waiting for the job to complete or for a cancel message from the channel.
[2023-07-11 17:38:29Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[2023-07-11 17:38:29Z INFO JobServer] Creating websocket client ...wss://pipelines.actions.githubusercontent.com/mBh68xKhi8LyM7tp3vECvYXNFvuV4gyVGgmYCteuEZP9JH92QN/00000000-0000-0000-0000-000000000000/_apis/distributedtask/hubs/Actions/plans/46a5125b-d751-48e5-b5c3-aa974757c8e3/timelines/46a5125b-d751-48e5-b5c3-aa9
74757c8e3/records/8a40b056-4605-5299-54fb-5b28f1300fcc/feedstream?api-version=6.0-preview
[2023-07-11 17:38:29Z INFO JobServer] Attempting to start websocket client with delay 00:00:00.
[2023-07-11 17:38:29Z INFO JobServerQueue] Start process web console line queue.
[2023-07-11 17:38:29Z INFO JobServerQueue] Start process file upload queue.
[2023-07-11 17:38:29Z INFO JobServerQueue] Start results file upload queue.
[2023-07-11 17:38:29Z INFO JobServerQueue] Starting results-based upload queue...
[2023-07-11 17:38:29Z INFO JobServerQueue] Start process timeline update queue.
[2023-07-11 17:38:29Z INFO ExecutionContext] Initialize GitHub context
[2023-07-11 17:38:29Z INFO ExecutionContext] Initialize Env context
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Bin': 'C:\actions-runner\bin'
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Root': 'C:\actions-runner'
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Diag': 'C:\actions-runner\_diag'
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Bin': 'C:\actions-runner\bin'
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Root': 'C:\actions-runner'
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Diag': 'C:\actions-runner\_diag'
[2023-07-11 17:38:29Z INFO JobRunner] Starting the job execution context.
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Bin': 'C:\actions-runner\bin'
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Root': 'C:\actions-runner'
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Work': 'C:\actions-runner\_work'
[2023-07-11 17:38:29Z INFO JobRunner] Validating directory permissions for: 'C:\actions-runner\_work'
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Bin': 'C:\actions-runner\bin'
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Root': 'C:\actions-runner'
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Work': 'C:\actions-runner\_work'
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Tools': 'C:\actions-runner\_work\_tool'
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Bin': 'C:\actions-runner\bin'
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Root': 'C:\actions-runner'
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Work': 'C:\actions-runner\_work'
[2023-07-11 17:38:29Z INFO HostContext] Well known directory 'Temp': 'C:\actions-runner\_work\_temp'
[2023-07-11 17:38:30Z INFO JobServer] Successfully started websocket client.
[2023-07-11 17:38:30Z INFO JobRunner] Getting job extension.
[2023-07-11 17:38:30Z INFO JobRunner] Initialize job. Getting all job steps.
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Bin': 'C:\actions-runner\bin'
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Root': 'C:\actions-runner'
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Diag': 'C:\actions-runner\_diag'
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Bin': 'C:\actions-runner\bin'
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Root': 'C:\actions-runner'
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Diag': 'C:\actions-runner\_diag'
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Bin': 'C:\actions-runner\bin'
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Root': 'C:\actions-runner'
[2023-07-11 17:38:30Z INFO HostContext] Well known config file 'Credentials': 'C:\actions-runner\.credentials'
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Bin': 'C:\actions-runner\bin'
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Root': 'C:\actions-runner'
[2023-07-11 17:38:30Z INFO HostContext] Well known config file 'SetupInfo': 'C:\actions-runner\.setup_info'
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Bin': 'C:\actions-runner\bin'
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Root': 'C:\actions-runner'
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Work': 'C:\actions-runner\_work'
[2023-07-11 17:38:30Z INFO PipelineDirectoryManager] Loading tracking config if exists: C:\actions-runner\_work\_PipelineMapping\pytorch/pytorch\PipelineFolder.json
[2023-07-11 17:38:30Z INFO PipelineDirectoryManager] Updating job run properties.
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Bin': 'C:\actions-runner\bin'
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Root': 'C:\actions-runner'
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Work': 'C:\actions-runner\_work'
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Bin': 'C:\actions-runner\bin'
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Root': 'C:\actions-runner'
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Work': 'C:\actions-runner\_work'
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Bin': 'C:\actions-runner\bin'
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Root': 'C:\actions-runner'
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Work': 'C:\actions-runner\_work'
[2023-07-11 17:38:30Z INFO JobExtension] Downloading actions
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Bin': 'C:\actions-runner\bin'
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Root': 'C:\actions-runner'
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Work': 'C:\actions-runner\_work'
[2023-07-11 17:38:30Z INFO HostContext] Well known directory 'Actions': 'C:\actions-runner\_work\_actions'
[2023-07-11 17:38:31Z INFO JobServerQueue] Try to append 1 batches web console lines for record 'd706d174-da6f-49c9-8556-c033464977d9', success rate: 1/1.
[2023-07-11 17:38:31Z INFO JobServerQueue] Job timeline record has been updated for the first time.
[2023-07-11 17:38:31Z ERR  JobExtension] Caught exception from JobExtension Initialization: System.IO.IOException: The process cannot access the file 'C:\actions-runner\_work\_actions\pytorch\pytorch\main\aten\src\ATen\native\quantized\cpu\qnnpack\wrappers\u8rmax' because it is being used by another process.
   at System.IO.FileSystem.RemoveDirectoryInternal(String fullPath, Boolean topLevel, Boolean allowDirectoryNotEmpty)
   at System.IO.FileSystem.RemoveDirectory(String fullPath, Boolean recursive)
   at System.IO.DirectoryInfo.Delete()
   at GitHub.Runner.Sdk.IOUtil.DeleteDirectory(String path, Boolean contentsOnly, Boolean continueOnContentDeleteError, CancellationToken cancellationToken)
   at GitHub.Runner.Worker.ActionManager.PrepareActionsAsync(IExecutionContext executionContext, IEnumerable`1 steps, Guid rootStepId)
   at GitHub.Runner.Worker.JobExtension.InitializeJob(IExecutionContext jobContext, AgentJobRequestMessage message)
[2023-07-11 17:38:31Z INFO ExecutionContext] Publish step telemetry for current step {
  "action": "setup_job",
  "type": "runner",
  "stage": "Pre",
  "stepId": "d706d174-da6f-49c9-8556-c033464977d9",
  "result": "failed",
  "errorMessages": [
    "The process cannot access the file 'C:\\actions-runner\\_work\\_actions\\pytorch\\pytorch\\main\\aten\\src\\ATen\\native\\quantized\\cpu\\qnnpack\\wrappers\\u8rmax' because it is being used by another process."
  ],
  "executionTimeInSeconds": 2,
  "startTime": "2023-07-11T17:38:30.0590734Z",
  "finishTime": "2023-07-11T17:38:31.7718078Z"
}.
[2023-07-11 17:38:31Z ERR  JobRunner] Job initialize failed.
[2023-07-11 17:38:31Z ERR  JobRunner] Caught exception from InitializeJob: System.IO.IOException: The process cannot access the file 'C:\actions-runner\_work\_actions\pytorch\pytorch\main\aten\src\ATen\native\quantized\cpu\qnnpack\wrappers\u8rmax' because it is being used by another process.
   at System.IO.FileSystem.RemoveDirectoryInternal(String fullPath, Boolean topLevel, Boolean allowDirectoryNotEmpty)
   at System.IO.FileSystem.RemoveDirectory(String fullPath, Boolean recursive)
   at System.IO.DirectoryInfo.Delete()
   at GitHub.Runner.Sdk.IOUtil.DeleteDirectory(String path, Boolean contentsOnly, Boolean continueOnContentDeleteError, CancellationToken cancellationToken)
   at GitHub.Runner.Worker.ActionManager.PrepareActionsAsync(IExecutionContext executionContext, IEnumerable`1 steps, Guid rootStepId)
   at GitHub.Runner.Worker.JobExtension.InitializeJob(IExecutionContext jobContext, AgentJobRequestMessage message)
   at GitHub.Runner.Worker.JobRunner.RunAsync(AgentJobRequestMessage message, CancellationToken jobRequestCancellationToken)
[2023-07-11 17:38:31Z INFO JobRunner] Shutting down the job server queue.
[2023-07-11 17:38:31Z INFO JobServerQueue] Fire signal to shutdown all queues.
[2023-07-11 17:38:32Z INFO JobServerQueue] All queue process task stopped.
[2023-07-11 17:38:32Z INFO JobServerQueue] Try to append 1 batches web console lines for record 'd706d174-da6f-49c9-8556-c033464977d9', success rate: 1/1.
[2023-07-11 17:38:32Z INFO JobServerQueue] Web console line queue drained.
[2023-07-11 17:38:32Z INFO JobServerQueue] Uploading 2 files in one shot.
[2023-07-11 17:38:32Z INFO JobServerQueue] Try to upload 2 log files or attachments, success rate: 2/2.
[2023-07-11 17:38:32Z INFO JobServerQueue] File upload queue drained.
[2023-07-11 17:38:32Z INFO JobServerQueue] Starting results-based upload queue...
[2023-07-11 17:38:32Z INFO JobServerQueue] Results upload queue drained.
[2023-07-11 17:38:32Z INFO JobServerQueue] Timeline update queue drained.
[2023-07-11 17:38:32Z INFO JobServerQueue] Disposing job server ...
[2023-07-11 17:38:32Z INFO JobServerQueue] Disposing results server ...
[2023-07-11 17:38:32Z INFO JobServerQueue] All queue process tasks have been stopped, and all queues are drained.
[2023-07-11 17:38:32Z INFO TempDirectoryManager] Cleaning runner temp folder: C:\actions-runner\_work\_temp
[2023-07-11 17:38:32Z INFO HostContext] Well known directory 'Bin': 'C:\actions-runner\bin'
[2023-07-11 17:38:32Z INFO HostContext] Well known directory 'Root': 'C:\actions-runner'
[2023-07-11 17:38:32Z INFO HostContext] Well known directory 'Diag': 'C:\actions-runner\_diag'
[2023-07-11 17:38:32Z INFO HostContext] Well known config file 'Telemetry': 'C:\actions-runner\_diag\.telemetry'
[2023-07-11 17:38:32Z INFO JobRunner] Raising job completed event
[2023-07-11 17:38:33Z INFO Worker] Job completed.
nikola-jokic commented 1 year ago

Hey @huydhn,

Can you please inspect if something is still holding a lock on that file before you do a cleanup? Since this is a system exception, unless the runner is holding a lock and trying to delete the file in parallel (I'm almost certain that is not the case), there must be something not releasing the file on time causing this error. And since this is not happening always, there might be a subprocess releasing a lock on the file after the command returns.

huydhn commented 1 year ago

Thank you for looking into this. While we know for sure that a process is holding a lock to that folder, it's very tricky to know its identity. So I hope to be able to gather more information via the runner:

Do you know if there is a way to get more information from the runner log about the exception? Or may be adding retrying support for this step to make it more resilient? I'm also curious to know if other people encounter the same issue.

nikola-jokic commented 1 year ago

From the runner log, I don't think we can know which process is holding a lock. We would have to inspect the processes. However, you can try to write a powershell script, maybe to figure out which process is holding a lock on the file. Retries sounds reasonable for this situation, but then I don't know if it would be a right solution. At this point, everything should exit, and failure to delete a file can indicate that some process is being left running. Regardless, I will bring this question up in the meeting in case I'm wrong ☺️

Blackhex commented 1 year ago

Certainly it won't be a fix to the root case (the process holding the lock), but can you please confirm or decline whether the runners are running in an empemeral way?  I.e. in the complete isolation, namely, whether the workspace of one run is completely different folder than the workspace directory of an another run (even if it is started after the first finishes)? Running the runners in complete isolation could the mitigate impact of the issue on the other runs.

nikola-jokic commented 1 year ago

Hey @Blackhex,

It will re-use the same directory based on the repository path

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open 365 days with no activity. Remove stale label or comment or this will be closed in 15 days.