NuGet / Home

Repo for NuGet Client issues
Other
1.5k stars 253 forks source link

RestoreTask randomly fails after upgrading to latest version #13545

Closed akoeplinger closed 2 months ago

akoeplinger commented 3 months ago

NuGet Product Used

dotnet.exe

Product Version

6.11.0-rc.90

Worked before?

NuGet in dotnet sdk 9.0-preview4

Impact

It's more difficult to complete my work

Repro Steps & Context

Since we updated to a dotnet 9.0-preview6 SDK we started seeing RestoreTask failing randomly in the VMR build: https://github.com/dotnet/sdk/issues/41477#issuecomment-2154989462

After the fix in https://github.com/NuGet/NuGet.Client/pull/5845 we bumped again (i.e. we're now using a very recent nuget.client) we now see these messages

D:\a_work\1\s.dotnet\sdk\9.0.100-preview.6.24311.23\NuGet.targets(170,5): The feed 'dotnet-public [https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet-public/nuget/v3/index.json]' lists package 'System.Text.Json.7.0.3' but multiple attempts to download the nupkg have failed. The feed is either invalid or required packages were removed while the current operation was in progress. Verify the package exists on the feed and try again. [D:\a_work\1\s\src\nuget-client\build\restorehelper.targets]

or

D:\a_work\1\s.dotnet\sdk\9.0.100-preview.6.24311.23\NuGet.targets(170,5): error : The feed 'dotnet-public [https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet-public/nuget/v3/index.json]' lists package 'Castle.Core.5.1.1' but multiple attempts to download the nupkg have failed. The feed is either invalid or required packages were removed while the current operation was in progress. Verify the package exists on the feed and try again. [D:\a_work\1\s\src\efcore\EFCore.sln]

I've attached some binlogs: binlogs.zip

It's happening for random packages and given we only started seeing this when bumping the dotnet SDK (and thus nuget) I think it's more likely a nuget bug than an issue with the AzDO feed.

Is there some way we can enable more logging?

/cc @nkolev92 @kartheekp-ms

Verbose Logs

No response

lewing commented 3 months ago

We're seeing this and #13551 repeatedly in CI is there something we can do to help get this resolved

cc @jebriede @aortiz-msft

jeffkl commented 3 months ago

I think what's going on is Azure DevOps Artifacts Services public upstreams is causing this. When a package is restored from a feed with upstream(s) for the first time, the account doing the restore must have Contributor permissions to populate the package. If the user is not a Contributor, the package will not be made available, and the server returns an HTTP 404.

https://learn.microsoft.com/en-us/azure/devops/artifacts/how-to/set-up-upstream-sources?view=azure-devops&tabs=nuget

image

The NuGet repo recently hit this problem and we had to get permissions fixed on the feed to get unblocked. I guess CI agents need to be using an account with the correct permissions in order to populate feeds? Can you confirm if this workflow is using upstream feeds?

akoeplinger commented 3 months ago

The dotnet-public feed doesn't have upstreams.

jeffkl commented 3 months ago

@nkolev92 do you know what's going on? It looks like the same symptoms where the feed lists a particular version but then NuGet receives a 404 when trying to download the package.

akoeplinger commented 3 months ago

Looking at the code this is the only place where PackageNotFoundProtocolException is thrown: https://github.com/NuGet/NuGet.Client/blob/20f05435be385abfe74737b6433dc80fd3b3b504/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L69-L88

Following ProcessNupkgStreamAsync feels like there might be a race condition hidden somewhere :)

lewing commented 3 months ago

Also tracking in https://github.com/dotnet/dnceng/issues/3100 and https://github.com/dotnet/runtime/issues/103526

nkolev92 commented 3 months ago

@lewing, @akoeplinger

I'll reach out offline as well, but if possible, can anyone try using .NET 9, P4 and seeing if the problems persist.

am11 commented 3 months ago

https://github.com/NuGet/Home/issues/13551 is tracked as https://github.com/dotnet/runtime/issues/103526. Retry always fixes the error so it's intermittent but consistently failing couple of builds a day.

kartheekp-ms commented 3 months ago

I worked with @embetten from the Azure Artifacts team. I was told that the dotnet-public feed has responded with the following status codes for the last 7 days.

Response Code Count
303 63,421,011
200 52,874,790
404 8,246,998
409 233
401 17,498
202 472
500 33

Upon checking if there are any 404s for either 'System.Text.Json.7.0.3' or 'Castle.Core.5.1.1' packages, surprisingly no records were found in telemetry and IIS/AFD logs. Emily mentioned that one potential reason why the feed responded with a 404 might be that the AzDO service was still processing the package after its push while there was a request to download it, which would have failed with a 404. For example, the 'System.Text.Json.7.0.3' package was ingested into the feed on 06/13, and the issue was reported on the same day.

nkolev92 commented 3 months ago

@kartheekp-ms @embetten

https://github.com/dotnet/sdk/issues/41590 lists recent failures, in particular: https://dev.azure.com/dnceng-public/public/_build/results?buildId=711675&view=logs&j=5960db8e-22cf-5780-65a0-f69bf3efbd20&t=663b2309-e924-59c7-4a84-d59cf2e17f86&l=1171 for example:

The feed 'net-sdk-supporting-feed [https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet9/nuget/v3/index.json]' lists package 'Roslyn.Diagnostics.Analyzers.3.11.0-beta1.24170.2'

However it was publishing on 3/20, https://dev.azure.com/dnceng/public/_artifacts/feed/dotnet9/NuGet/Roslyn.Diagnostics.Analyzers/overview/3.11.0-beta1.24170.2.

So I don't think timing is the likely issue. Can you maybe check out the 404s for that feed?

kartheekp-ms commented 3 months ago

However it was publishing on 3/20, Can you maybe check out the 404s for that feed?

Yeah, it is the same case with 'Castle.Core.5.1.1' package for dotnet-public feed where the package was ingested to the feed on Jan 5, 2023 but the NuGet error log says that package is not found.

nkolev92 commented 3 months ago

Hey everyone,

I've done some more investigations with some help from @zivkan and @jeffkl, and while we don't have anything conclusive we do have a way forward.

Investigation summary:

As mentioned a few times, currently what's getting thrown is a PackageNotFoundException. https://github.com/NuGet/NuGet.Client/blob/79e6e486ee9f80296601485358ce6ac9f5e302d1/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L82-L88 This can happen in a situation, where the server returns 404, or 204, or 500 etc. If the http call is not successful, NuGet will retry up to 6 times: https://github.com/NuGet/NuGet.Client/blob/79e6e486ee9f80296601485358ce6ac9f5e302d1/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L260. When the calls are made, the code in https://github.com/NuGet/NuGet.Client/blob/79e6e486ee9f80296601485358ce6ac9f5e302d1/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L73-L76 won't get executed unless something was downloaded from the server.

Good news is that there's a lot of logging. We expect that 404s might be happen, and NuGet does log all http results, but it does so at normal verbosity. There are also messages in case retries happen: https://github.com/NuGet/NuGet.Client/blob/79e6e486ee9f80296601485358ce6ac9f5e302d1/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L287-L313. Unfortunately, these messages are not available to us for investigation right now, given that nearly all restores at minimal verbosity right now.

I did consider adding additional messages, but I believe there's plenty of information in the current logs (once we get the higher verbosity logs), to make further progress.

Call to action

  1. Given that there are reports of .NET 9 P4, can someone please attempt using .NET 9 P4 to help us narrow down whether there's a client regression, or if there's something happening server side.
  2. Can the restore verbosity be increased to detailed?

Beyond this, it'd also be helpful to know things like: How often are these issues happening? How many failures vs how many builds?

akoeplinger commented 3 months ago

@nkolev92 I'm having trouble figuring out how to increase the log verbosity for the RestoreTask, would you mind showing which property you had in mind?

Note that I already posted the binlogs in the original comment in case you missed it.

akoeplinger commented 3 months ago

I looked at the binlogs again and I don't see a GET request for the packages which fail later on, only for the index.json.

nkolev92 commented 3 months ago

@akoeplinger

There's no special property, the cli argument should work.

akoeplinger commented 3 months ago

so you're saying this will emit more data into the binlog?

nkolev92 commented 3 months ago

No, my bad I should've been explicit. I was just answering that part of the question.

Checking now to see if I can understand how the Get/Cache message would never happen and yet we'd get to throw that exception.

nkolev92 commented 3 months ago

The only way I could see the log message not showing and not have another message such as cancellation or a retry one, is if it somehow exited before hitting this code path: https://github.com/NuGet/NuGet.Client/blob/d1d2e260de9b8f20175e7766aa88e1ce1ece6b4e/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L258-L263.

I think we'll probably need to add some custom logging here.

WeihanLi commented 3 months ago

C:\hostedtoolcache\windows\dotnet\sdk\9.0.100-preview.5.24307.3\NuGet.targets(170,5): error MSB4181: The "RestoreTask" task returned false but did not log an error. [D:\a\1\s\WeihanLi.Web.Extensions.sln]

I may meet the same issue, success with a retry

CI: https://weihanli.visualstudio.com/Pipelines/_build/results?buildId=6428&view=logs&j=12f1170f-54f2-53f3-20dd-22fc7dff55f9&t=7dae026f-9e99-5d65-075e-3f7579577f94&s=96ac2280-8cb4-5df5-99de-dd2da759617d build logs: https://weihanli.visualstudio.com/4e7b2e5c-fece-4f31-9ba4-4bc9d0f3fa3d/_apis/build/builds/6428/logs/11 Commit: https://github.com/WeihanLi/WeihanLi.Web.Extensions/commit/6fa2899fabbe67066ee7393abe504569f5dd505f

akoeplinger commented 3 months ago

I think we'll probably need to add some custom logging here.

@nkolev92 is this being worked on? We can do some runs on AzDO with a patched nuget.client .dll if necessary

ViktorHofer commented 3 months ago

Kindly asking for an update on this issue. This is still affecting most of our builds.

So roughly about 100 hits in the last 5 days.

nkolev92 commented 3 months ago

I was out unexpectedly all of last week. Resuming investigations today.

nkolev92 commented 3 months ago

Catching up here, @ViktorHofer, I think https://github.com/dotnet/runtime/issues/103823 looks like it's a runtime issue and something likely unrelated to this particular problem, so I'm not tracking that one as a duplicate of this.

nkolev92 commented 3 months ago

I think we'll probably need to add some custom logging here.

@nkolev92 is this being worked on? We can do some runs on AzDO with a patched nuget.client .dll if necessary

@akoeplinger

Yep, we'll have to do that. I'm adding some logging on a branch. I'll ping once I have a build.

nkolev92 commented 2 months ago

Update: Added some logging and got some results. Unfortunately those are not enough to help us narrow down the issue. We're gonna try again.

akoeplinger commented 2 months ago

We figured out the root cause: https://github.com/NuGet/NuGet.Client/pull/5905

I ran three sdk-unified-build-full builds with this change and didn't hit the issue yet :)

nkolev92 commented 2 months ago

Some more context for the followers of this issue: