actions / cache

Cache dependencies and build outputs in GitHub Actions
MIT License
4.53k stars 1.2k forks source link

Cache restoration is orders of magnitude slower on Windows compared to Linux. #752

Open Adnn opened 2 years ago

Adnn commented 2 years ago

We are using this action to cache Conan packages.

Our current cached is ~300 MB on both Linux (ubuntu-20.04) and Windows (windows-2019). Sadly, where the cache step routinely takes ~10 seconds on Linux, it oftentimes takes ~5 minutes on Windows.

This makes iterations frustrating, as the rest of the workflow takes about 2 minutes to complete, we get a global x3 time penalty because of this.

As far as I can tell, the archive retrieval time is comparable, it really is the cache un-archiving which seems to take very long on Windows.


Our latest run below for illustration purposes.

Linux: image

Windows: image

rikhuijzer commented 2 years ago

really is the cache un-archiving which seems to take very long on Windows.

That's because tar is extremely slow on Windows. Some related links:

Adnn commented 2 years ago

Thank you for your reponse!

(If the issue is with the specific un-archiving utility, maybe there is an alternative that the action could use on Windows to get better performances?)

mattjohnsonpint commented 2 years ago

I'm experiencing the same issue. Here is one of our latest runs after implementing dependency caching.

https://github.com/getsentry/sentry-dotnet/runs/6327391846?check_suite_focus=true

Time to restore dependencies from cache:

Ubuntu macOS Windows
18s 29s 4m 5s

With timings visible, it's clearly the tar operation that is the culprit:

image

mattjohnsonpint commented 2 years ago

Looks like this has been going on for awhile... See also #442 and #529. Hopefully @bishal-pdMSFT can make some improvements here? Maybe just provide an optional parameter to the action that would tell it to use .zip (or another format) instead of .tgz on Windows? 7-Zip is pre-installed on the virtual environments.

wrexbe commented 2 years ago

@vsvipul To restore the cache, the ubuntu server takes 16 seconds, and the windows server takes 2 minutes, and 29 seconds. https://github.com/space-wizards/space-station-14/runs/7017334343?check_suite_focus=true

lvpx commented 2 years ago

Hi @Adnn @wrexbe @mattjohnsonpint, we do understand. Actually zstd is disabled on Windows due issues with bsd tar, hence the slowness of restore. While we look into it, you could use the workaround suggested here to improve your hosted windows runner performance. This would enable zstd while using GNU tar.

TLDR; Add the following step to your workflow before the cache step. That's it.

- if: ${{ runner.os == 'Windows' }}
      name: Use GNU tar
      shell: cmd
      run: |
        echo "Adding GNU tar to PATH"
        echo C:\Program Files\Git\usr\bin>>"%GITHUB_PATH%"

Hope this helps.

Adnn commented 2 years ago

@pdotl Thank you for your response. It is encouraging to read this issue will be looked into!

On the other hand, I tried to add the Use GNU tar step you provided, just before the cache action in our pipeline. The step ran and correctly output "Adding GNU tar to PATH", yet I did not observe any noticeable speed-up (but as I am not used to do testing at the workflow level, I may be proven wrong).

mattjohnsonpint commented 2 years ago

@pdotl - Thanks. But like @Adnn, I have to report that changing to gnu tar gave no significant performance gain.

https://github.com/getsentry/sentry-dotnet/runs/7979994529?check_suite_focus=true#step:3:58

Tue, 23 Aug 2022 18:07:43 GMT Cache Size: ~1133 MB (1187773080 B) Tue, 23 Aug 2022 18:07:43 GMT "C:\Program Files\Git\usr\bin\tar.exe" --use-compress-program "zstd -d" -xf D:/a/_temp/6caa1415-0002-48ab-a4c2-b1ac4df21961/cache.tzst -P -C D:/a/sentry-dotnet/sentry-dotnet --force-local Tue, 23 Aug 2022 18:13:01 GMT Cache restored successfully

As you can see from the timestamps, it took over 5 minutes to decompress. The same build on Linux and macOS took less than 30 seconds.

I realize that it would require some significant changes to actions/toolkit, but I think to really get comparable perf we're going to need ability to use a format other than tar.gz. Perhaps .7z since 7-Zip is pre-installed on the runners?

lvpx commented 1 year ago

Hi @Adnn @mattjohnsonpint we are tracking this issue in #984. Please check the current proposal there and provide comments/feedback if any. Closing this as duplicate.

AMoo-Miki commented 1 year ago

@pdotl since #984 was narrowed down to cross-os (ref) should this issue be reopened?

hipstersmoothie commented 1 year ago

In my case our the things we're caching takes ~10-15 minutes to run and restoring the cache takes longer.

ofek commented 1 year ago

Has there been any update on this?

LabhanshAgrawal commented 1 year ago

https://learn.microsoft.com/en-us/virtualization/community/team-blog/2017/20171219-tar-and-curl-come-to-windows https://github.com/libarchive/libarchive seems to have good enough performance on windows for tar maybe this can be used here?

mschfh commented 1 year ago

Could this be a regression of https://github.com/microsoft/Windows-Dev-Performance/issues/27#issuecomment-677955496? (Updating the tar binary might have caused the Defender signatures to no longer match, disabling the optimization)

Safihre commented 1 year ago

@lvpx @Phantsure Can we expect this to be on the team's radar anytime soon or should we just accept it?

lvpx commented 12 months ago

@Safihre we are not part of GitHub anymore unfortunately. Hopefully someone will be pick these pending issues and respond.

jezdez commented 10 months ago

Does anyone know who the product owner is who might shed light on the progress of this?

Phantsure commented 10 months ago

@bethanyj28 seems to be releasing latest version. They might help get this on someone's radar.

gaurish commented 10 months ago

We are facing this issue. due to this, caching takes longer than actual run. Thus, we have disabled caching for now.

Kindly provide an update on when it will be fixed.

paco-sevilla commented 10 months ago

I just spent 3 days trying to find a workaround for this, so I hope this helps someone...

Background:

We use Bazel, and the local Bazel cache, although it is not necessarily big (~200 MB in our case), it contains a ton of small files (including symlinks). For us, it took ~19 Minutes to untar those 200 MB on Windows 😖

Failed attempts:

Based on this Windows issue (also linked above), my initial attempts were related to the Windows Defender:

BTW: The cache action already uses the tar.exe from Git Bash (C:\Program Files\Git\usr\bin\tar.exe), so this workaround (suggested by @lvpx >1 year ago) makes no difference anymore.

Our current workaround:

The underlaying issue is the large amount of files that need to be extracted, so let's reduce the cache to a single file: ➜ Let's put the cache in a Virtual Hard Disk!

So this is our solution:

runs-on: windows-2022
steps:
  ...

  - name: Cache Bazel (VHDX)
    uses: actions/cache@v3
    with:
      path: C:/bazel_cache.vhdx
      key: cache-windows

  - name: Create, mount and format VHDX
    run: |
      $Volume = `
        If (Test-Path C:/bazel_cache.vhdx) { `
            Mount-VHD -Path C:/bazel_cache.vhdx -PassThru | `
            Get-Disk | `
            Get-Partition | `
            Get-Volume `
        } else { `
            New-VHD -Path C:/bazel_cache.vhdx -SizeBytes 10GB | `
            Mount-VHD -Passthru | `
            Initialize-Disk -Passthru | `
            New-Partition -AssignDriveLetter -UseMaximumSize | `
            Format-Volume -FileSystem NTFS -Confirm:$false -Force `
        }; `
      Write-Output $Volume; `
      Write-Output "CACHE_DRIVE=$($Volume.DriveLetter)`:/" >> $env:GITHUB_ENV

  - name: Build and test
    run: bazelisk --output_base=$env:CACHE_DRIVE test --config=windows //...

  - name: Dismount VHDX
    run: Dismount-VHD -Path C:/bazel_cache.vhdx

I know... it's long and ugly, but it works: Extracting the cache only takes 7 seconds and mounting the VHDX only takes 19 seconds! 🎉 This means that we reduced the cache restoration time by a factor of 44 🤓

This is based on Example 3 of the Mount-VHD docs and Example 5 of the New-VHD docs. I'm by no means proficient in powershell scripting, so there might be room for improvement...

A few details about the solution:

Safihre commented 10 months ago

Thanks @paco-sevilla. It's just a bit crazy we have to resort to these kind of solutions instead of a proper solution from GitHub. This has been going on for ages. And it can't just be the free users (like me) that experience this, also the corporate customers that actually pay for each Actions-minute must experience this and need reduction.

paco-sevilla commented 10 months ago

I totally agree! I actually work for one of those enterprise costumers that pay for a certain amount of minutes. And it's not only about the 💸... The developer experience is (to put it nicely) poor, if something that should take 2-3 minutes suddenly takes 10 times longer 😕

Anyway, I just wanted to share my findings. Maybe they inspire a proper solution. BTW: This is a public repo, so I guess anyone could contribute to a proper solution, without having to wait on Github staff.

malkia commented 8 months ago

@paco-sevilla - Thank you so much!!! I've been just experiencing this issue - where 25 mins to decompress the bazel cache (granted It's probably caching more than needs to). Thank you! Thank you! Thank you!

ArkadiuszMichalski commented 6 months ago

Any plan to solve this? Even the official documentation doesn't mention anything about performance issues under Windows:
https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows I lost some time to implement the cache on Windows, and in the end it turns out that it is much worse than a regular online update (save or restore cache takes 6min, when online update takes 1 min.)

gaurish commented 6 months ago

+1 Attempting to cache on Windows proves to be a significant waste of time; I spent hours on it. At the very least, the official documentation should be updated to document this "known issue."

samypr100 commented 5 months ago

@paco-sevilla This is awesome. I found myself recently needing to do this across multiple jobs and was surprised there were no Github Actions that could do this. I ended up creating one since I thought others might benefit samypr100/setup-dev-drive.

paco-sevilla commented 5 months ago

@samypr100 Your Github Action is awesome! And the examples are super useful! Thanks for creating it and also for the acknowledgements in the README 🙂

I think that, if something similar would be implemented in this Action and the documentation was updated, this issue could finally be closed (after more than 2 years)...

malkia commented 5 months ago

Awesome job @samypr100 - I'll have to try it out! - I was experimenting here with NTFS and tried ReFS - https://github.com/malkia/opentelemetry-cpp/blob/main/.github/workflows/otel_sdk.yml#L28 - though the VHDX gets to be much bigger (in size, granted mostly zeroes). Wonder what else I can do there.

paco-sevilla commented 5 months ago

I also experimented with ReFS on my local Windows machine and the performance (even when declared as a DevDrive with AntiVirus turned off) is significantly worse (slower and, as @malkia says, requires more space) than NTFS for my use-case (Bazel cache).

samypr100 commented 5 months ago

Thats unfortunate to hear, the action does give you the flexibility to change the format to NTFS if desired.

malkia commented 5 months ago

This would completely derail the conversation, but I'm evaluating ReFS / DevDrive at work, and we used a lot of reparse points for our custom source file system. It turns out in ReFS 3.10 (the one shipped today I think in Win11) this reparse point is somewhere between 8-16kb (can't tell, I've empirically derived this number), but then I switched for week to Insiders edition, and there it was ReFS 3.14 - and this was down to less than thousand bytes (per our special link).

Now the bazel cache have none of that (the disk cache), it's just a CAS, but ReFS is improving at fronts, so I guess it's worth re-evaluating it later. For our use case, we are trying to limit the amount of AV scanning and other minifilters (though we had to explicitly add our minifilter in the loop for things to work). It supports block-level (not file-level!) deduplication - so if you have situation where a build copies over and over .dll/.exe files this would be great to have!