Open Adnn opened 2 years ago
really is the cache un-archiving which seems to take very long on Windows.
That's because tar
is extremely slow on Windows. Some related links:
Thank you for your reponse!
(If the issue is with the specific un-archiving utility, maybe there is an alternative that the action could use on Windows to get better performances?)
I'm experiencing the same issue. Here is one of our latest runs after implementing dependency caching.
https://github.com/getsentry/sentry-dotnet/runs/6327391846?check_suite_focus=true
Time to restore dependencies from cache:
Ubuntu | macOS | Windows |
---|---|---|
18s | 29s | 4m 5s |
With timings visible, it's clearly the tar
operation that is the culprit:
Looks like this has been going on for awhile... See also #442 and #529. Hopefully @bishal-pdMSFT can make some improvements here? Maybe just provide an optional parameter to the action that would tell it to use .zip
(or another format) instead of .tgz
on Windows? 7-Zip is pre-installed on the virtual environments.
@vsvipul To restore the cache, the ubuntu server takes 16 seconds, and the windows server takes 2 minutes, and 29 seconds. https://github.com/space-wizards/space-station-14/runs/7017334343?check_suite_focus=true
Hi @Adnn @wrexbe @mattjohnsonpint, we do understand. Actually zstd
is disabled on Windows due issues with bsd tar, hence the slowness of restore. While we look into it, you could use the workaround suggested here to improve your hosted windows runner performance. This would enable zstd while using GNU tar.
TLDR; Add the following step to your workflow before the cache step. That's it.
- if: ${{ runner.os == 'Windows' }}
name: Use GNU tar
shell: cmd
run: |
echo "Adding GNU tar to PATH"
echo C:\Program Files\Git\usr\bin>>"%GITHUB_PATH%"
Hope this helps.
@pdotl Thank you for your response. It is encouraging to read this issue will be looked into!
On the other hand, I tried to add the Use GNU tar
step you provided, just before the cache action in our pipeline. The step ran and correctly output "Adding GNU tar to PATH", yet I did not observe any noticeable speed-up (but as I am not used to do testing at the workflow level, I may be proven wrong).
@pdotl - Thanks. But like @Adnn, I have to report that changing to gnu tar gave no significant performance gain.
https://github.com/getsentry/sentry-dotnet/runs/7979994529?check_suite_focus=true#step:3:58
Tue, 23 Aug 2022 18:07:43 GMT Cache Size: ~1133 MB (1187773080 B) Tue, 23 Aug 2022 18:07:43 GMT "C:\Program Files\Git\usr\bin\tar.exe" --use-compress-program "zstd -d" -xf D:/a/_temp/6caa1415-0002-48ab-a4c2-b1ac4df21961/cache.tzst -P -C D:/a/sentry-dotnet/sentry-dotnet --force-local Tue, 23 Aug 2022 18:13:01 GMT Cache restored successfully
As you can see from the timestamps, it took over 5 minutes to decompress. The same build on Linux and macOS took less than 30 seconds.
I realize that it would require some significant changes to actions/toolkit, but I think to really get comparable perf we're going to need ability to use a format other than tar.gz
. Perhaps .7z
since 7-Zip is pre-installed on the runners?
Hi @Adnn @mattjohnsonpint we are tracking this issue in #984. Please check the current proposal there and provide comments/feedback if any. Closing this as duplicate.
@pdotl since #984 was narrowed down to cross-os (ref) should this issue be reopened?
In my case our the things we're caching takes ~10-15 minutes to run and restoring the cache takes longer.
Has there been any update on this?
https://learn.microsoft.com/en-us/virtualization/community/team-blog/2017/20171219-tar-and-curl-come-to-windows
https://github.com/libarchive/libarchive seems to have good enough performance on windows for tar
maybe this can be used here?
Could this be a regression of https://github.com/microsoft/Windows-Dev-Performance/issues/27#issuecomment-677955496? (Updating the tar binary might have caused the Defender signatures to no longer match, disabling the optimization)
@lvpx @Phantsure Can we expect this to be on the team's radar anytime soon or should we just accept it?
@Safihre we are not part of GitHub anymore unfortunately. Hopefully someone will be pick these pending issues and respond.
Does anyone know who the product owner is who might shed light on the progress of this?
@bethanyj28 seems to be releasing latest version. They might help get this on someone's radar.
We are facing this issue. due to this, caching takes longer than actual run. Thus, we have disabled caching for now.
Kindly provide an update on when it will be fixed.
I just spent 3 days trying to find a workaround for this, so I hope this helps someone...
We use Bazel, and the local Bazel cache, although it is not necessarily big (~200 MB in our case), it contains a ton of small files (including symlinks). For us, it took ~19 Minutes to untar those 200 MB on Windows 😖
Based on this Windows issue (also linked above), my initial attempts were related to the Windows Defender:
D:/
drive (where the cache is extracted) is already excluded by default in the Github runners.BTW: The cache action already uses the tar.exe
from Git Bash (C:\Program Files\Git\usr\bin\tar.exe
), so this workaround (suggested by @lvpx >1 year ago) makes no difference anymore.
The underlaying issue is the large amount of files that need to be extracted, so let's reduce the cache to a single file: ➜ Let's put the cache in a Virtual Hard Disk!
So this is our solution:
runs-on: windows-2022
steps:
...
- name: Cache Bazel (VHDX)
uses: actions/cache@v3
with:
path: C:/bazel_cache.vhdx
key: cache-windows
- name: Create, mount and format VHDX
run: |
$Volume = `
If (Test-Path C:/bazel_cache.vhdx) { `
Mount-VHD -Path C:/bazel_cache.vhdx -PassThru | `
Get-Disk | `
Get-Partition | `
Get-Volume `
} else { `
New-VHD -Path C:/bazel_cache.vhdx -SizeBytes 10GB | `
Mount-VHD -Passthru | `
Initialize-Disk -Passthru | `
New-Partition -AssignDriveLetter -UseMaximumSize | `
Format-Volume -FileSystem NTFS -Confirm:$false -Force `
}; `
Write-Output $Volume; `
Write-Output "CACHE_DRIVE=$($Volume.DriveLetter)`:/" >> $env:GITHUB_ENV
- name: Build and test
run: bazelisk --output_base=$env:CACHE_DRIVE test --config=windows //...
- name: Dismount VHDX
run: Dismount-VHD -Path C:/bazel_cache.vhdx
I know... it's long and ugly, but it works: Extracting the cache only takes 7 seconds and mounting the VHDX only takes 19 seconds! 🎉 This means that we reduced the cache restoration time by a factor of 44 🤓
This is based on Example 3 of the Mount-VHD docs and Example 5 of the New-VHD docs. I'm by no means proficient in powershell scripting, so there might be room for improvement...
A few details about the solution:
E:/
in the Github runners. However, this is not necessarily deterministic. I tried assigning a specific Drive Letter, but there are two issues with that: 1. the drive could be occupied and 2. the Mount-VHD
command doesn't support that (only the New-Partition
command does). \
So, we store the path of the drive in a new CACHE_DRIVE
environment variable that we can use in later steps.always()
or !cancelled()
in the Dismount VHDX
step, because if something fails, the cache will be disregarded anyways. So, we don't care if the volume gets dismounted or not 🤷 Thanks @paco-sevilla. It's just a bit crazy we have to resort to these kind of solutions instead of a proper solution from GitHub. This has been going on for ages. And it can't just be the free users (like me) that experience this, also the corporate customers that actually pay for each Actions-minute must experience this and need reduction.
I totally agree! I actually work for one of those enterprise costumers that pay for a certain amount of minutes. And it's not only about the 💸... The developer experience is (to put it nicely) poor, if something that should take 2-3 minutes suddenly takes 10 times longer 😕
Anyway, I just wanted to share my findings. Maybe they inspire a proper solution. BTW: This is a public repo, so I guess anyone could contribute to a proper solution, without having to wait on Github staff.
@paco-sevilla - Thank you so much!!! I've been just experiencing this issue - where 25 mins to decompress the bazel cache (granted It's probably caching more than needs to). Thank you! Thank you! Thank you!
Any plan to solve this? Even the official documentation doesn't mention anything about performance issues under Windows:
https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows
I lost some time to implement the cache on Windows, and in the end it turns out that it is much worse than a regular online update (save or restore cache takes 6min, when online update takes 1 min.)
+1 Attempting to cache on Windows proves to be a significant waste of time; I spent hours on it. At the very least, the official documentation should be updated to document this "known issue."
@paco-sevilla This is awesome. I found myself recently needing to do this across multiple jobs and was surprised there were no Github Actions that could do this. I ended up creating one since I thought others might benefit samypr100/setup-dev-drive.
@samypr100 Your Github Action is awesome! And the examples are super useful! Thanks for creating it and also for the acknowledgements in the README 🙂
I think that, if something similar would be implemented in this Action and the documentation was updated, this issue could finally be closed (after more than 2 years)...
Awesome job @samypr100 - I'll have to try it out! - I was experimenting here with NTFS and tried ReFS - https://github.com/malkia/opentelemetry-cpp/blob/main/.github/workflows/otel_sdk.yml#L28 - though the VHDX gets to be much bigger (in size, granted mostly zeroes). Wonder what else I can do there.
I also experimented with ReFS on my local Windows machine and the performance (even when declared as a DevDrive with AntiVirus turned off) is significantly worse (slower and, as @malkia says, requires more space) than NTFS for my use-case (Bazel cache).
Thats unfortunate to hear, the action does give you the flexibility to change the format to NTFS if desired.
This would completely derail the conversation, but I'm evaluating ReFS / DevDrive at work, and we used a lot of reparse points for our custom source file system. It turns out in ReFS 3.10 (the one shipped today I think in Win11) this reparse point is somewhere between 8-16kb (can't tell, I've empirically derived this number), but then I switched for week to Insiders edition, and there it was ReFS 3.14 - and this was down to less than thousand bytes (per our special link).
Now the bazel cache have none of that (the disk cache), it's just a CAS, but ReFS is improving at fronts, so I guess it's worth re-evaluating it later. For our use case, we are trying to limit the amount of AV scanning and other minifilters (though we had to explicitly add our minifilter in the loop for things to work). It supports block-level (not file-level!) deduplication - so if you have situation where a build copies over and over .dll/.exe files this would be great to have!
We are using this action to cache Conan packages.
Our current cached is ~300 MB on both Linux (
ubuntu-20.04
) and Windows (windows-2019
). Sadly, where the cache step routinely takes ~10 seconds on Linux, it oftentimes takes ~5 minutes on Windows.This makes iterations frustrating, as the rest of the workflow takes about 2 minutes to complete, we get a global x3 time penalty because of this.
As far as I can tell, the archive retrieval time is comparable, it really is the cache un-archiving which seems to take very long on Windows.
Our latest run below for illustration purposes.
Linux:
Windows: