actions / runner-images

GitHub Actions runner images
MIT License
9.17k stars 2.84k forks source link

Workflow fails to download dependencies from NuGet.org #3038

Closed AaronVanGeffen closed 2 years ago

AaronVanGeffen commented 3 years ago

Description
For the OpenLoco project, we gratefully make use of GitHub Actions for our CI. To this end, we set up a workflow for our Windows CI: https://github.com/OpenLoco/OpenLoco/blob/master/.github/workflows/ci.yml#L18-L59

The Visual Studio project this workflow builds has been set up to depend on a package from NuGet.org: https://github.com/OpenLoco/OpenLoco/blob/master/src/OpenLoco/openloco.vcxproj#L352-L354 https://www.nuget.org/packages/openloco.dependencies

This workflow worked fine until two days ago. However, we have noticed the package no longer gets retrieved properly, causing most (but not all) runs to fail.

Area for Triage:
C/C++

Question, Bug, or Feature?:
Bug

Virtual environments affected

Image version 20210321.1 (broken) 20210316.1 (fine)

Expected behavior
Successful run on image version 20210316.1, 3 days ago https://github.com/OpenLoco/OpenLoco/runs/2181845951

Restore:
  Restoring packages for D:\a\OpenLoco\OpenLoco\src\OpenLoco\openloco.vcxproj...
    GET https://api.nuget.org/v3-flatcontainer/openloco.dependencies/index.json
    OK https://api.nuget.org/v3-flatcontainer/openloco.dependencies/index.json 173ms
    GET https://api.nuget.org/v3-flatcontainer/openloco.dependencies/1.3.0/openloco.dependencies.1.3.0.nupkg
    OK https://api.nuget.org/v3-flatcontainer/openloco.dependencies/1.3.0/openloco.dependencies.1.3.0.nupkg 105ms
  Installed OpenLoco.Dependencies 1.3.0 from https://api.nuget.org/v3/index.json with content hash 2kTdy9fJCCYiCDi5QPxbJQ8st/7AfT+gt+rjAz++yONLyxOIBQbdk2g5Oqv5VSU/UAPZ8BrNSanNdxcYDszkFA==.
  Committing restore...
  Generating MSBuild file D:\a\OpenLoco\OpenLoco\src\OpenLoco\obj\openloco.vcxproj.nuget.g.props.
  Generating MSBuild file D:\a\OpenLoco\OpenLoco\src\OpenLoco\obj\openloco.vcxproj.nuget.g.targets.
  Writing assets file to disk. Path: D:\a\OpenLoco\OpenLoco\src\OpenLoco\obj\project.assets.json
  Restored D:\a\OpenLoco\OpenLoco\src\OpenLoco\openloco.vcxproj (in 7.18 sec).

Actual behavior
Failed run on 20210321.1 today: https://github.com/OpenLoco/OpenLoco/runs/2204773635

Restore:
  Restoring packages for D:\a\OpenLoco\OpenLoco\src\OpenLoco\openloco.vcxproj...
D:\a\OpenLoco\OpenLoco\src\OpenLoco\openloco.vcxproj : error NU1101: Unable to find package OpenLoco.Dependencies. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [D:\a\OpenLoco\OpenLoco\openloco.sln]
  Committing restore...
  Generating MSBuild file D:\a\OpenLoco\OpenLoco\src\OpenLoco\obj\openloco.vcxproj.nuget.g.props.
  Generating MSBuild file D:\a\OpenLoco\OpenLoco\src\OpenLoco\obj\openloco.vcxproj.nuget.g.targets.
  Writing assets file to disk. Path: D:\a\OpenLoco\OpenLoco\src\OpenLoco\obj\project.assets.json
  Failed to restore D:\a\OpenLoco\OpenLoco\src\OpenLoco\openloco.vcxproj (in 942 ms).

Repro steps
Please look at https://github.com/OpenLoco/OpenLoco/actions

maxim-lobanov commented 3 years ago

Hello @AaronVanGeffen , do you have nuget.config file in your repo? I think it is known issue when you don't have config file. Could you please try to add simple nuget.config like described in the message: https://github.com/actions/setup-dotnet/issues/155#issuecomment-761195782

AaronVanGeffen commented 3 years ago

Hello @maxim-lobanov, thank you for the suggestion. Indeed, there was no nuget.config file in our repository. Previously this worked fine, so I was not aware we would need any. However, adding the config file as you suggested appears to have helped: https://github.com/OpenLoco/OpenLoco/pull/855

maxim-lobanov commented 3 years ago

Cool, glad to hear that it helped.

Looks like sometimes dotnet doesn't resolve packages from remote when doesn't find packages in local cache. I think It is not the new issue since based on https://github.com/actions/setup-dotnet/issues/155#issuecomment-761195782, it fails pretty randomly without nuget.config file.

The root cause is still unclear but using nuget.config is recommendation from NuGet team to deal with this issue: https://github.com/NuGet/Home/issues/10586#issuecomment-783689013

AaronVanGeffen commented 3 years ago

Thank you for the explanation. Just to confirm, CI has been working reliably tonight with the nuget.config file in.

miketimofeev commented 3 years ago

@AaronVanGeffen thanks for the confirmation! I'm going to close the issue but feel free to contact us if you have any concerns.

rokups commented 3 years ago

I bumped into same issue. While workaround does work i noticed that PATH printed by setup-dotnet action differs on failing and succeeding builds. Failing builds have C:\hostedtoolcache\windows\Java_Adopt_jdk\8.0.282+8\x64\bin in the PATH and succeeding builds have C:\Program Files\Java\jdk8u282-b08\bin. I suspect not all build workers use same image even if all of them use windows-latest.

miketimofeev commented 3 years ago

@rokups it takes 3-4 days to propagate the new image (with Java in the hostedtoolcache directory) to all the environments. We're going to finish the deployment on Monday

fabriciomurta commented 3 years ago

Hello, I believe this issue should be better investigated and (hopefully) solved without the need of wrapping our own nuget.config file. The NuGet config should be the same across different instances of Windows runners and I would also expect them to match Mac and Linux (which does not suffer from this problem) runners;

It sounds reasonable to have nuget.org source enabled by default if any packages are not found in local cache (which makes sense for the runtime environments not to keep downloading packages form NuGet.org every run).

Note: I have two different runs in the same 20210330.2 environment version, one failing and another succeeding. It really looks like runners are not being properly cleaned up after a previous run.

miketimofeev commented 3 years ago

@fabriciomurta no, it's not possible — every run is performed on a clean agent

asklar commented 3 years ago

@miketimofeev @maxim-lobanov Can you please reactivate this issue and help us get to a resolution? The core NuGet issue NuGet/Home#10586 was closed without a resolution other than "clean the nuget cache" or "have a nuget.config" which don't seem productive, since restoring projects used to work fine before the last AzDO pipelines/GHA updates. You're saying it's not possible that this is due to non-clean state on the agent (which I'd agree with), but nuget folks are implying the contrary.

Both this and the NuGet issue are closed and we need someone to step up in either of the teams and provide a solution. We have multiple repos - both internal and external - that got broken since the latest update.

miketimofeev commented 3 years ago

@asklar reopened. Could you please share some repo where the issue persists? We've tried to reproduce last time without any luck

rokups commented 3 years ago

You may use https://github.com/rokups/rbfx if it is not too fat. Delete nuget.config from root folder to make it fail.

asklar commented 3 years ago

@miketimofeev thanks. We're hitting this in https://github.com/microsoft/react-native-windows/ among others

fabriciomurta commented 3 years ago

This is random, so to troubleshoot you should add a step to pinpoint the host of the runner (the actual computer) and schedule it to run in any of the failing sample repos provided, say every 30 min until it fails. I suspect dotnet nuget list source should show a different output in the affected runners virtual instances (that's my bet with the action I made)

On Wed, Apr 7, 2021, 4:49 AM Alexander Sklar @.***> wrote:

@miketimofeev https://github.com/miketimofeev thanks. We're hitting this in https://github.com/microsoft/react-native-windows/ among others

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/actions/virtual-environments/issues/3038#issuecomment-814687988, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACR3XBDJ35NWNZ4VM54SEC3THQE77ANCNFSM4Z4CSCXQ .

database64128 commented 3 years ago

@fabriciomurta According to https://github.com/NuGet/Home/issues/10586#issuecomment-809998961, running dotnet nuget list source actually fixes the issue on the runner.

marysaka commented 3 years ago

@miketimofeev Hitting it here too but it's not random https://github.com/Ryujinx/Ryujinx/ (Example of a failing PR run: https://github.com/Ryujinx/Ryujinx/runs/2287451767)

jantoineqci commented 3 years ago

I have consistently received failures this morning because of this. I am also using windows-latest. Listing the nuget sources, only the offline one is register. I had to manually register nuget.org to get it to work. image

image

fabriciomurta commented 3 years ago

@jantoineqci this really supports my suspicion the default public nuget source is not being set up and is exactly what the action I made seeks to address. Glad to know it is (very) likely to fix it.

About the 30-min tests I suggested above, it seems somebody already set up a 30-min test in a repo (https://github.com/actions/virtual-environments/issues/1090#issuecomment-814556942), and it is not hitting the problematic instances. So being assigned to those nuget-source-less instances may be related to the subscription or actions usage demand of the repository.

Another hint is that this been happening for some time in Azure DevOps environment, according to this comment: https://github.com/actions/virtual-environments/issues/1090#issuecomment-751354009

@database64128 , About "any dotnet nuget command fixes the issue" (https://github.com/NuGet/Home/issues/10586#issuecomment-809998961), the following comment (https://github.com/NuGet/Home/issues/10586#issuecomment-810264508) refutes the theory.

miketimofeev commented 3 years ago

@fabriciomurta @jantoineqci we're working on a fix now to have the default source presented in nuget.config

jantoineqci commented 3 years ago

@miketimofeev Thank you! how long will the fix take to propagate? I will put a continue-on-error on my add source step so the fix doesn't break my build again.

miketimofeev commented 3 years ago

@jantoineqci I hope we will start the deployment on Monday and it will take 3-4 days if nothing goes wrong 🤞

ArchieCoder commented 3 years ago

Can the previous working Windows image be restored? It broke us too. I'm pretty sure others will notice this.

vsafonkin commented 3 years ago

@ArchieCoder, could you try dotnet nuget add source https://api.nuget.org/v3/index.json -n nuget.org as first step of workflow? It should be work as temporary workaround.

ArchieCoder commented 3 years ago

@vsafonkin I have this error:

error: The name specified has already been added to the list of available package sources. Provide a unique name.

I added this line in my workflow:

12_Hack.txt

miketimofeev commented 3 years ago

@ArchieCoder could you provide the output from dotnet nuget list source?

ArchieCoder commented 3 years ago

@vsafonkin It is in the link 12_Hack.txt, sorry it was not obvious in my previous post

asklar commented 3 years ago

@miketimofeev thanks good to hear - note that in our case our projects are not using .net core 5 (nor the dotnet CLI). They're either C++, or C# UWP, yet they hit the same issue. We restore the project during msbuild by passing /restore (and /p:RestorePackagesConfig=true for C++ apps)

miketimofeev commented 3 years ago

@asklar does it mean adding nuget.org as a source doesn't help in your case? dotnet nuget add source https://api.nuget.org/v3/index.json -n nuget.org

asklar commented 3 years ago

@miketimofeev we already have nuget.org in our nuget.config files. Even when the packages fail to restore, at the end it lists the sources and nuget.org is in there:

2021-04-07T00:46:42.6006157Z D:\a\1\s\packages\e2e-test-app\windows\ReactUWPTestApp\ReactUWPTestApp.csproj : error NU1101: Unable to find package Microsoft.UI.Xaml. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [D:\a\1\s\packages\e2e-test-app\windows\ReactUWPTestApp.sln]

...

2021-04-07T00:49:50.4499469Z   NuGet Config files used:
2021-04-07T00:49:50.4500599Z       C:\Users\VssAdministrator\AppData\Roaming\NuGet\NuGet.Config
2021-04-07T00:49:50.4501691Z       C:\Program Files (x86)\NuGet\Config\Microsoft.VisualStudio.FallbackLocation.config
2021-04-07T00:49:50.4502788Z       C:\Program Files (x86)\NuGet\Config\Microsoft.VisualStudio.Offline.config
2021-04-07T00:49:50.4503733Z       C:\Program Files (x86)\NuGet\Config\Xamarin.Offline.config
2021-04-07T00:49:50.4504944Z       D:\a\1\s\vnext\NuGet.Config
2021-04-07T00:49:50.4505686Z   
2021-04-07T00:49:50.4506486Z   Feeds used:
2021-04-07T00:49:50.4507369Z       C:\Program Files (x86)\Microsoft SDKs\NuGetPackages\
2021-04-07T00:49:50.4508797Z       https://pkgs.dev.azure.com/ms/react-native/_packaging/react-native-public/nuget/v3/index.json
2021-04-07T00:49:50.4511047Z       https://api.nuget.org/v3/index.json
2021-04-07T00:49:50.4511784Z   
2021-04-07T00:49:50.4512987Z   Installed:
2021-04-07T00:49:50.4553915Z       87 package(s) to D:\a\1\s\vnext\Microsoft.ReactNative.Managed.CodeGen\Microsoft.ReactNative.Managed.CodeGen.csproj
2021-04-07T00:49:50.4555679Z       24 package(s) to D:\a\1\s\vnext\Microsoft.ReactNative.Managed\Microsoft.ReactNative.Managed.csproj
2021-04-07T00:49:50.4599808Z Done Building Project "D:\a\1\s\packages\e2e-test-app\windows\ReactUWPTestApp.sln" (Restore target(s)) -- FAILED.
asklar commented 3 years ago

it's as if nuget is requiring that a package must exist on all sources? some packages won't exist on the local sources, some might even be in private feeds (like our Azure Artifacts feed), so this seems like a bad assumption on nuget's part? CC @rainersigwald @rrelyea in case this looks familiar

fabriciomurta commented 3 years ago

For the time being, using my little action has avoided the issue when it should happen:

      - name: Ensure NuGet Source
        uses: fabriciomurta/ensure-nuget-source@v1

Basically the action will ensure there is a nuget source (regardless of the name) pointing to https://api.nuget.org/v3/index.json; if not, it will add/update nuget.org source pointing to that.

So whenever you kick in a broken runner it will just fix the source for you.

I have published the action to GitHub marketplace: https://github.com/marketplace/actions/ensure-nuget-source

So the action step may sit in your workflow and shouldn't break the CI process even after the actual fix is implemented.

fabriciomurta commented 3 years ago

@miketimofeev about https://github.com/actions/virtual-environments/issues/3038#issuecomment-814979307, if we just add a step to run that command, then CI will fail when it hits a correct runner, because the source would already exist. So should at least ignore success/failure of the step if want the step not to break workflow.

maxim-lobanov commented 3 years ago

@fabriciomurta , thank you for sharing, good point!

We are trying to understand the root cause of this issue. It exists for some time but reproduce rarely. Looks like it started to happen more often with latest updates but nothing obvious on images that could cause it except VS update.

fabriciomurta commented 3 years ago

To me it is like the actual host (the physical machine hosting the runners) is providing the broken default nuget configuration. So without enforcement from the virtual environment template, the built virtual machine is just inheriting whatever's in its host, instead of ensuring the NuGet sources include the public one.

A hint of what I'm stating is, the macos and linux hosts only have the nuget.org entry, and not that Microsoft Visual Studio Offline Packages one. This is easy to see in the unit tests run by the action I written: https://github.com/fabriciomurta/ensure-nuget-source/runs/2287645659?check_suite_focus=true. Of course, it may be the case that dotnet/nuget/we installs also that source by default only for windows systems, so it is just a chance.

From the link above, see Test 1 in each platform; for macos and ubuntu, I needed an extra step to add a mock source due to another issue that does not allow me to remove all NuGet sources (in a check I remove the nuget.org default source to ensure the action adds it back)

note: in the action run above I am just highlighting the different nuget sources between platforms; in that action run the windows host got the correct NuGet source, so it didn't hit one of the affected runner environments; in fact I couldn't hit it in my side repository; it seems to like big repos :)

asklar commented 3 years ago

My point is that nuget is erroring out because it didn't find one of the packages in the "VS offline packages" feed, which should not be an error, it should keep trying the other sources; if it gets to try the nuget.org source it will find them.

eekamouse commented 3 years ago

I tried the workaround on our build and we are getting the following with the workaround:

"The name specified has already been added to the list of available package sources. Provide a unique name."

fabriciomurta commented 3 years ago

@asklar I agree with your point; per your logs, you are facing a little different issue than I am.

The issue I have is consistent with @jantoineqci (https://github.com/actions/virtual-environments/issues/3038#issuecomment-814904484) where the NuGet source is not set up at given windows-latest runs.

In your case it seems the sources are correctly set up yet the NuGet packages are not found, for some reason.

ArchieCoder commented 3 years ago

FYI @eekamouse @vsafonkin

Thanks @fabriciomurta, your fix works. Is it safe to keep it even after msft fixes the issue?

fabriciomurta commented 3 years ago

@ArchieCoder yes, if the NuGet source is there, the step is not doing anything, so it should be safe to be kept with or without whatever fix is implemented in this issue.

It should only potentially break CI runs if the official public NuGet repository URL stops responding at https://api.nuget.org/v3/index.json. (because it ensures you have at least one source pointing to that URL; it will add a new one if no nuget.org source exists, or update it if it points elsewhere -- that's where it could break if public NuGet URL address changed).

eekamouse commented 3 years ago

FYI @eekamouse @vsafonkin

Thanks @fabriciomurta, your fix works. Is it safe to keep it even after msft fixes the issue?

  • name: Ensure NuGet Source uses: fabriciomurta/ensure-nuget-source@v1

Ah. Perfect. Ya I just switched the command to do a list source and that did the trick.

fabriciomurta commented 3 years ago

@eekamouse not really... See this comment: https://github.com/actions/virtual-environments/issues/3038#issuecomment-814904484

It wouldn't fix in this case if I was just doing the list command. The action does list for diagnostics but in case the source is not there (which happened for the comment linked above), it will be added. Per the various reports around the subject we have three issues:

The action also checks if the source is set up but is disabled. I never hit this scenario; but if exists and disabled, it will then re-enable the NuGet source. All using the dotnet nuget command (so may not work in environments not having dotnet installed -- is there any windows runner without dotnet installed? :smile:)

miketimofeev commented 3 years ago

We have another report, which looks like @asklar case. The only difference we've found in the logs so far is the MSBuild version.

Failed: 2021-04-06T19:39:24.2927008Z MSBuild auto-detection: using msbuild version '16.9.0.16703' from 'C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\MSBuild\Current\bin'. Use option -MSBuildVersion to force nuget to use a specific version of MSBuild. Working: 2021-04-03T11:56:32.1196795Z MSBuild auto-detection: using msbuild version '16.9.0.11203' from 'C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\MSBuild\Current\bin'. Use option -MSBuildVersion to force nuget to use a specific version of MSBuild.

For that particular customer changing .net version from 3.1 to 5 solves the issue

asklar commented 3 years ago

In our case we are not using neither .net core 3 nor .net 5, we are using .net UWP.

pellared commented 3 years ago

These problems are also affecting Azure Pipelines. AFAIK that the same machines are used under the hood.

Example: https://dev.azure.com/opentelemetry/pipelines/_build/results?buildId=3370&view=logs&jobId=0f2e0f6c-b584-54be-74cb-0c70b940453f&j=0f2e0f6c-b584-54be-74cb-0c70b940453f&t=2fd4f782-c69f-5b16-baf4-49a1fc40b1f6

Error NU1101: Unable to find package Microsoft.NETFramework.ReferenceAssemblies. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages
vsafonkin commented 3 years ago

@asklar, @ArchieCoder, @fabriciomurta, @pellared, could you try this step as workaround?

dotnet nuget add source https://api.nuget.org/v3/index.json -n nuget.org --configfile $env:APPDATA\NuGet\NuGet.Config
ArchieCoder commented 3 years ago

@vsafonkin Your fix work AND "uses: fabriciomurta/ensure-nuget-source@v1" also works

pellared commented 3 years ago

Hello @AaronVanGeffen , do you have nuget.config file in your repo? I think it is known issue when you don't have config file. Could you please try to add simple nuget.config like described in the message: actions/setup-dotnet#155 (comment)

This works as well

fabriciomurta commented 3 years ago

@vsafonkin I've been looking up our windows-bound workflows in the last ~10 runs or so; we didn't hit the problematic environment, so I myself can't really tell the workarounds are fixing anything.

It may be the case, as pointed at https://github.com/actions/virtual-environments/issues/3038#issuecomment-814857550, that I am hitting that scenario, and simply by checking if nuget sources beforehand is "de-triggering" the issue. But I am a little skeptical about that, I just think I didn't hit the jackpot yet. Our project was not deterministically and insistently hitting the same instance.

Actually, I noticed if the issue triggered and I re-run the job, it was falling again on the broken instance. Initiating a new action run (via another workflow_dispatch, push, etc), was throwing me to an ok instance. But again, may have been coincidence.

btw thanks for you all for the effort on fixing the issue and providing feedback! I will make sure to post an update here if I can find anything else that could help.

vsafonkin commented 3 years ago

@asklar, @ArchieCoder, @pellared, @eekamouse, @rokups, @fabriciomurta, could you please try another workaround for your builds?

Remove-Item $env:APPDATA\NuGet\NuGet.Config

The cause of the issue is empty config file in user' appdata. We want to make sure the deletion this file works too. Thank you!

pellared commented 3 years ago

@vsafonkin You can make a fork from https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/commit/716974d3050afbb0d9cfa2310acfb4229ac6cb1f (commit before the workaround was applied) and experiment yourself 😉