actions / runner-images

GitHub Actions runner images
MIT License
9.17k stars 2.84k forks source link

extremely slow Network and Disk IO on Windows agent compared to Ubuntu/Mac #3577

Closed jetersen closed 2 years ago

jetersen commented 2 years ago

Description
Actually we are seeing the same behavior on GitHub actions running a shell command for dotnet restore takes a very long time on windows even when using actions/cache πŸ˜…

Originally posted by @jetersen in https://github.com/actions/virtual-environments/issues/1733#issuecomment-860123620

I cannot replicate locally where a restore with a full nuget cache it takes less than a second.

Area for Triage:
.NET Core

Question, Bug, or Feature?:
Bug

Virtual environments affected

Image version

Current runner version: '2.278.0'
Operating System
  Microsoft Windows Server 2019
  10.0.17763
  Datacenter
Virtual Environment
  Environment: windows-2019
  Version: 20210531.1
  Included Software: actions/virtual-environments@win19/20210531.1/images/win/Windows2019-Readme.md
  Image Release: actions/virtual-environments@win19%2F20210531.1 (release)

Expected behavior
actions/cache should speed up dotnet restore to take a few seconds

Actual behavior
Dotnet restore even with actions/cache takes well over 30 seconds.

Repro steps
https://github.com/jetersen/dotnet.restore.slow.github.action

al-cheb commented 2 years ago

Hello, @jetersen. Please provide a small repo how to reproduce the issue.

jetersen commented 2 years ago

Well repro was super easy actually.

https://github.com/jetersen/dotnet.restore.slow.github.action

No cache 41s Cached 33s

Locally on my machine the cache speeds up the restore to under 1 seconds. Starting dotnet restore also seem relatively slow when called through pwsh shell 😰

jetersen commented 2 years ago

The scary part is that the slowness is consistent at least according to the dotnet restore timings.

Third using cmd cached 37s

On bigger projects this time scale goes into minutes and we would expect that the cache would speed up the restore significantly.

jetersen commented 2 years ago

Well Linux and MacOS does not have this issue.

Ubuntu-20.04 no cache 14s Ubuntu-20.04 with cache 7s MacOS seems to also benefit from the cache with 6s before cache it was 57s.

This is more along the lines of what I would expect.

No improvement between each run on windows with cache...

jetersen commented 2 years ago

In log you can clearly see the timescale on windows not making sense but it is consistent.

  Determining projects to restore...
  Restored D:\a\dotnet.restore.slow.github.action\dotnet.restore.slow.github.action\tests\Library.UnitTests\Library.UnitTests.csproj (in 20.78 sec).
  Restored D:\a\dotnet.restore.slow.github.action\dotnet.restore.slow.github.action\tests\Api.UnitTests\Api.UnitTests.csproj (in 20.78 sec).
  Restored D:\a\dotnet.restore.slow.github.action\dotnet.restore.slow.github.action\src\Library\Library.csproj (in 2 ms).
  Restored D:\a\dotnet.restore.slow.github.action\dotnet.restore.slow.github.action\src\Api\Api.csproj (in 9 ms).
jetersen commented 2 years ago

@al-cheb Hope the repro is good enough, we were seeing the similar behavior in our private repo on our unit test project but also on some of our projects with more dependencies. We do restore our internal packages from an internal NuGet source but that would not affect caching.

al-cheb commented 2 years ago

@jetersen, Looks like it doesn't work only with .net core 5.0.301. It is recommended to use actions/setup-dotnet@v1 task to pick up version:

E.g. - https://github.com/vsafonkin-test-organization/dotnet.restore.slow.github.action/runs/2829395627?check_suite_focus=true

      - name: Setup dotnet
        uses: actions/setup-dotnet@v1
        with:
          dotnet-version: '5.0.300'

      - uses: actions/cache@v2
        with:
          path:
            ~/.nuget/packages
          key: ${{ runner.os }}-nuget-5.0.300-${{ hashFiles('**/packages.lock.json') }}
          restore-keys: |
            ${{ runner.os }}-nuget-

      - name: dotnet restore
        run: |
          dotnet restore --verbosity n

image

jetersen commented 2 years ago

Good thing is you now have an good repro case to fix it and hopefully it won't reappear in the next dotnet release πŸ˜…

vsafonkin commented 2 years ago

Hi @jetersen, dotnet 6.0 (preview version) also works properly. Looks like this bug affects only the latest patch version of 5.0

jetersen commented 2 years ago

@vsafonkin odd, I guess someone else already caught the regression πŸ˜…

jetersen commented 2 years ago

@vsafonkin @al-cheb I think we can close it as you mentioned it only affects the latest 5.0.301 patch. I would still like to know what went wrong in the latest patch but not sure it is something we can fix here 😊

Thank you for investigating the issue.

SteveDesmond-ca commented 2 years ago

This is still happening with the latest version of .NET (6.0.x) -- I have projects consistently taking multiple minutes to restore a minimal set of dependencies.

jetersen commented 2 years ago

I have noticed the same thing recently.

jetersen commented 2 years ago

Just saw a 1 minute restore for very basic test packages: https://github.com/specshell/specshell.software.ndde/runs/4481718755?check_suite_focus=true @al-cheb @vsafonkin should we open the issue elsewhere is this better suited for dotnet org?

vsafonkin commented 2 years ago

Hi @jetersen, I cannot reproduce it for 6.0.100 version: restore without caching about 2 min: run restore with caching about 3 sec: run

Looks like it works properly or I miss something?

SteveDesmond-ca commented 2 years ago

My latest test shows that it's not just dotnet restore, but almost the entire windows-latest environment is at least an order of magnitude slower than the exact same workflow running on ubuntu-latest (diff).

Step Ubuntu run Windows run
Checkout 1 s 12 s
Install .NET 13 s 64 s
Restore 8 s 120 s
Build 4 s 8 s
Test 6 s 7 s
Total 34 s 216 s

The build and test steps are close enough, but all of the actions that require some sort of network transfer are extremely slow. Is this a known issue with Windows environments, and if so, can something be added to documentation somewhere to indicate that the huge performance hit is expected?

jetersen commented 2 years ago

@SteveDesmond-ca nice comparison, I can definitely attest to this based on our private repos.

jetersen commented 2 years ago

Ya, okay now the issue is no longer dotnet restore but networking is a significant issue in the test case I built:

action: skip telemetry and other slow downs image image

action: update dependencies image image

Even in the update dependencies action restoring new dependencies is slow on windows. Ubuntu VM spends 5 seconds Windows VM spends 14 seconds.

So network seems to be a issue.

Checking the output of the cache as it includes download speeds something is off: zstd is definitely also a benefit as you can see by the cache size.

Ubuntu

Received 50331648 of 138545810 (36.3%), 48.0 MBs/sec
Received 134351506 of 138545810 (97.0%), 64.0 MBs/sec
Received 138545810 of 138545810 (100.0%), 59.0 MBs/sec
Cache Size: ~132 MB (138545810 B)

Windows

Received 0 of 190559914 (0.0%), 0.0 MBs/sec
Received 96468992 of 190559914 (50.6%), 45.6 MBs/sec
Received 186365610 of 190559914 (97.8%), 58.9 MBs/sec
Received 190559914 of 190559914 (100.0%), 44.9 MBs/sec
Cache Size: ~182 MB (190559914 B)

Sometimes ubuntu seems a lot faster at download:

Received 138545810 of 138545810 (100.0%), 120.4 MBs/sec
Cache Size: ~132 MB (138545810 B)

Could this be fixed tweaking by TcpAckFrequency and TcpNoDelay on the windows VM?

Perhaps update the issue title: slow network transfer on Windows agent

Perhaps worth reconsidering #4424 to have DotNet 6.0 installed as it is the latest LTS.

zcsizmadia commented 2 years ago

I had cases when restore took a long time when nuget.org was not available to check the signatures. Setting env var NUGET_CERT_REVOCATION_MODE to offline solved that particular problem. Maybe it is the same issue? It will force NuGet to check the revocation status of the certificate only against the cached certificate revocation list, and NuGet will not attempt to reach revocation servers.

https://docs.microsoft.com/en-us/nuget/reference/errors-and-warnings/nu3028

jetersen commented 2 years ago

@zcsizmadia well as we just discussed in the last two comments is that restore is not really the issue but rather the networking seems to be a order of magnitude slower.

jetersen commented 2 years ago

@vsafonkin @al-cheb the slowness seems to affect all the windows VM: https://github.com/jetersen/dotnet.restore.slow.github.action/actions/runs/1568508906

Although windows 2022 seems to have a little faster tar.exe which is good I guess πŸ˜„

I am not saying we should fix windows 2016 but at least look at 2019 and 2022. And figure out why they are SO much slower than ubuntu or even macos. Usually MacOS is only 10 seconds slower but I have seen the agents flake sometimes and become slow as you can see in the screenshot below.

image

jetersen commented 2 years ago

I think I found the culprit on windows it is the System.Net.HttpClient: https://github.com/jetersen/dotnet.restore.slow.github.action/blob/main/.github/workflows/ci.yml

Note I tried to switch from use of powershell to pwsh with no luck. It is still the System.Net.HttpClient being slow.

Here I tried using curl.exe which seems to be able to download it 2-3x faster. image image

SteveDesmond-ca commented 2 years ago

I'm starting to look at timestamps in various actions, which is leading me to believe this is actually a disk I/O issue, not a network I/O issue.

jetersen commented 2 years ago

@SteveDesmond-ca care to explain the speed difference in curl.exe and tar.exe than? πŸ˜…

SteveDesmond-ca commented 2 years ago

Turns out it's both! Here is a minimal repro that doesn't even use .NET at all: downloading dependencies via yarn (for a real-world medium-sized app) has the same discrepancies between ubuntu-latest and windows-latest

https://github.com/SteveDesmond-ca/gh-actions-windows-slow/actions/runs/1574970778 OS Shell Startup Download 500 tgz Create 1K dirs Copy 12K files
ubuntu-latest bash 0.4 s 8.8 s 2.1 s 3.5 s
ubuntu-latest pwsh 0.4 s 11.0 s 1.8 s 4.9 s
windows-latest bash 11.1 s 85.4 s 2.3 s 4.4 s
windows-latest pwsh 10.0 s 99.9 s 3.9 s 6.0 s

As you can see, there are also compounding performance issues when using pwsh (the default on windows-latest) vs bash -- this may also contribute to @jetersen's remark that windows-2022 is slightly faster, since pwsh on windows-2022 is probably running a newer (faster) version of .NET under the hood.

jetersen commented 2 years ago

I don't think it is specifically pwsh that is the issue since I have several .NET repos using pwsh and ubuntu-latest with no restore penalty.

Clearly something on the VMs are configured incorrectly which causes terrible Network and Disk I/O operation.

jetersen commented 2 years ago

misclick πŸ˜…

jetersen commented 2 years ago

Timings from yesterday around 14:00 CET. Shows that this is still an issue.

image

vsafonkin commented 2 years ago

@jetersen, we are still investigating it, actually I'm confused that downloading via curl has very similar performance. For example, powershell script (download ~200 Mb 200 times):

For ($i = 0; $i -lt 200; $i++) {
  $outfile = "dotnet-$i.zip"
  curl https://dotnetcli.azureedge.net/dotnet/Sdk/5.0.403/dotnet-sdk-5.0.403-win-x64.zip -o $outfile
  rm $outfile
}

and result:

Screenshot 2021-12-28 at 4 40 57 PM
al-cheb commented 2 years ago

@jetersen , @SteveDesmond-ca , Could you check a restore step with params?

- name: Restore deps
  run: |
        dotnet new nugetconfig
        dotnet restore -v n --packages D:\pkgs
jetersen commented 2 years ago

@al-cheb that does not resolve the fact that downloading .NET 6 is also slower. The issue title has changed to be generic: Network and Disk IO in general seems slower on the windows agent. Where curl on windows for some reason does not have this issue.

al-cheb commented 2 years ago

@al-cheb that does not resolve the fact that downloading .NET 6 is also slower. The issue title has changed to be generic: Network and Disk IO in general seems slower on the windows agent. Where curl on windows for some reason does not have this issue.

Ubuntu agents have slightly higher IOPS disk performance configuration. We use install-dotnet.ps1 script for installation provided by DotNet team . The DownloadFile and Extract-Dotnet-Package functions are being slow. We will investigate how to improve performance those functions if we replace DownloadFile -> WebClient and Extract-Dotnet-Package -> 7zip

image

jetersen commented 2 years ago

@al-cheb https://github.com/jetersen/dotnet.restore.slow.github.action/commit/f342429d66b14ce848b79cd59733e00bd9536cb1 does indeed help with NuGet restore. I used a different approach to what you mentioned.

vsafonkin commented 2 years ago

@jetersen, how we see this issue is not related to network configuration of the Windows runner. We have tested Windows image from Azure Marketplace and got the same network performance. Looks like the problem appears from .NET side because we see performance degradation with newer versions of .NET(for example restore on .NET 5.0 is slower then restore on .NET 3.1). Also there is an issue with precached Nuget packages placed on C:\Program Files (x86)\Microsoft SDKs\NuGetPackages\ directory on the runner: restoring packages from cache is more slowly then downloading from nuget.org, so default nuget config (command dotnet new nugetconfig) contains only nuget.org feed and has better performance. Probably it's related to IO disk operations and looks like there is nothing we can do from our side unfortunately.

As @al-cheb mentions above Ubuntu agents have slightly higher IOPS disk performance configuration.

jetersen commented 2 years ago

I think dotnet new nugetconfig is a fine solution perhaps some of these things should be documented in the setup-dotnet action or somewhere it makes sense.

Of course there is still the issue with setup-dotnet being slow to install .NET SDK

vsafonkin commented 2 years ago

Of course there is still the issue with setup-dotnet being slow to install .NET SDK

Yes, but to install the .NET SDK, the setup-dotnet action uses scripts maintained by DotNet team as the recommended approach and we also cannot to do anything with it.

https://docs.microsoft.com/en-us/dotnet/core/tools/dotnet-install-script

vsafonkin commented 2 years ago

Closed, because there is nothing we can do from our side.

jetersen commented 2 years ago

Created https://github.com/actions/setup-dotnet/issues/260 πŸ˜“

ItalyPaleAle commented 1 year ago

We have I/O perf issues as well, and it's not related to .NET.

For example, when using the actions/cache@v3 Action to restore something from cache. The cache contains a lot of small files that are then unpacked.

Windows:

image

It took Windows 428 seconds to restore 1337MB.

Linux:

image

Although Linux had to restore only 1074MB, it took only 34 seconds.

As you can see, Windows was actually downloading files from the cache faster than Windows. However, I suspect that what took the longest was extracting them: to us, it seems to be more of a disk I/O issue rather than a network one.

This is one big example, but we see Windows agents being much slower in other things that perform disk I/O, such as pulling/pushing Docker images.

miketimofeev commented 1 year ago

@ItalyPaleAle windows and ubuntu runners are different in terms of disk allocation and that may explain the issue.

ItalyPaleAle commented 1 year ago

@miketimofeev the performance different is very significant however. I don't know what disks are used, but Linux agents processed 1000MB in 32s and WIndows ones in 320s in our tests above, so 10x slower. (I am aware that it's not apples-to-apples as the files in the caches aren't exactly the same, but this difference should be significant regardless)

wrexbe commented 1 year ago

This is very frustrating, the nuget downloading was slow on windows, even timing out sometimes, so to make it faster, and more reliable I wanted to use the cache, only to find out that the cache action is also very slow on windows. https://github.com/space-wizards/space-station-14/runs/7017334343?check_suite_focus=true The windows server took 2 minutes, and 29 seconds to restore the compressed file, and the Ubuntu server took 16 seconds. Is there a way to work around this?

ericmutta commented 10 months ago

Unfortunately, this is still a problem as of July/2023. The dotnet restore command on ubuntu-latest runs in seconds, but for both windows-2019 and windows-latest it takes a minute or more. Is there a fix for this? The problem actually costs us money because of per-minute billing 😞

thomhurst commented 7 months ago

I've got a pipeline that I run against Linux, Mac and Windows to test cross platform-ness. The Linux and Mac ones take a few minutes. The windows one can take 7, 8 or 9 minutes.

wrexbe commented 7 months ago

Yeah, windows is still like 11 minutes slower then linux for us. https://github.com/space-wizards/space-station-14/actions/runs/6272151167/job/17033126765

Danielku15 commented 7 months ago

I also see similar timings: image https://github.com/CoderLine/alphaSkia/actions/runs/6441066466

My build times are overall significantly slower in Windows. A dotnet build (with restore) of my Nuke build project takes 15secs on ubuntu-latest and >1min on windows-latest. But also other compilations take significantly more time.