bazelbuild / bazel-buildfarm

Bazel remote caching and execution service
https://bazel.build
Apache License 2.0
635 stars 199 forks source link

buildfarm-worker on Windows Server 2022 fails to clean up operation files #1744

Closed mikalailapko closed 3 weeks ago

mikalailapko commented 1 month ago

buildfarm-worker with a simple config (below) throws Java errors failing to delete temporary operations files. In my example, I had a simple helloworld c++ project that queues 16 operations on remote build. The first two run fine and clean up fine, the ones after run fine but the worker is unable to delete temporary files:

[SEVERE ] build.buildfarm.worker.ReportResultStage after - error destroying exec dir \tmp\worker\shard\operations\75362898-d9d2-4854-a91d-cc3b9cde09c2
java.nio.file.AccessDeniedException: \tmp\worker\shard\operations\75362898-d9d2-4854-a91d-cc3b9cde09c2\external\clang_toolchain-win64~~clang_toolchain_win64_files_ext~clang_toolchain-win64_files\win64\bin\clang++.exe
        at java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:89)
        at java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103)
        at java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:108)
        at java.base/sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:275)
        at java.base/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105)
        at java.base/java.nio.file.Files.delete(Files.java:1152)
        at build.buildfarm.common.io.Directories$1.visitFile(Directories.java:126)
        at build.buildfarm.common.io.Directories$1.visitFile(Directories.java:114)
        at java.base/java.nio.file.Files.walkFileTree(Files.java:2811)
        at java.base/java.nio.file.Files.walkFileTree(Files.java:2882)
        at build.buildfarm.common.io.Directories.remove(Directories.java:112)
        at build.buildfarm.worker.shard.CFCExecFileSystem.destroyExecDir(CFCExecFileSystem.java:531)
        at build.buildfarm.worker.shard.ShardWorkerContext.destroyExecDir(ShardWorkerContext.java:803)
        at build.buildfarm.worker.ReportResultStage.after(ReportResultStage.java:203)
        at build.buildfarm.worker.PipelineStage.iterate(PipelineStage.java:160)
        at build.buildfarm.worker.PipelineStage.runInterruptible(PipelineStage.java:51)
        at build.buildfarm.worker.PipelineStage.run(PipelineStage.java:64)
        at java.base/java.lang.Thread.run(Thread.java:833)

As pointed out by @werkt, this might have something to do with all links for one inode seem to not be deletable when any process has it open (for execute in this case). Worker config just in case:

backplane:
  redisUri: "reachableredis"
  queues:
    - name: "linux_x86_64"
      allowUnmatched: false
      properties:
        - name: "platform"
          value: "linux_x86_64"
    - name: "windows_x86_64"
      allowUnmatched: false
      properties:
        - name: "platform"
          value: "windows_x86_64"
    - name: "mac_arm64"
      allowUnmatched: false
      properties:
        - name: "platform"
          value: "mac_arm64"
server:
  publicName: "grpcs://reachablebuildfarmserver"
  port: 443
worker:
  linkInputDirectories: false
  execOwner: "Administrator"
  publicName: "reachableworker"
  port: 8981
  dequeueMatchSettings:
    allowUnmatched: false
    properties:
      - name: "platform"
        value: "windows_x86_64"

linkInputDirectories suggested by @werkt but didn't help. Pretty much the same config for Linux workers works fine for the same project.

And a procmon logfile that shows windows file operations filtered by clang++.exe from start (before queuing build) to finish (when the server windows task queue reaches 0). No other process is accessing the files, and the worker is run under Administrator. Logfile.CSV

werkt commented 1 month ago

Please give the branch https://github.com/werkt/bazel-buildfarm/tree/copy-exec-fs a try, with the following config:

worker:
  linkExecFileSystem: false

If this works, I'll flesh it into a mergable PR and get it available, with recommendations for Windows.

mikalailapko commented 1 month ago

Thanks a lot @werkt, this option did help! I'm not seeing any file deletion errors with it indeed.

werkt commented 3 weeks ago

Closing this, as we're leaving linkExecFileSystem: false as the mitigation (and should probably be default) on windows.