Closed akoeplinger closed 2 months ago
We're seeing this and #13551 repeatedly in CI is there something we can do to help get this resolved
cc @jebriede @aortiz-msft
I think what's going on is Azure DevOps Artifacts Services public upstreams is causing this. When a package is restored from a feed with upstream(s) for the first time, the account doing the restore must have Contributor permissions to populate the package. If the user is not a Contributor, the package will not be made available, and the server returns an HTTP 404.
The NuGet repo recently hit this problem and we had to get permissions fixed on the feed to get unblocked. I guess CI agents need to be using an account with the correct permissions in order to populate feeds? Can you confirm if this workflow is using upstream feeds?
The dotnet-public feed doesn't have upstreams.
@nkolev92 do you know what's going on? It looks like the same symptoms where the feed lists a particular version but then NuGet receives a 404 when trying to download the package.
Looking at the code this is the only place where PackageNotFoundProtocolException is thrown: https://github.com/NuGet/NuGet.Client/blob/20f05435be385abfe74737b6433dc80fd3b3b504/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L69-L88
Following ProcessNupkgStreamAsync
feels like there might be a race condition hidden somewhere :)
@lewing, @akoeplinger
I'll reach out offline as well, but if possible, can anyone try using .NET 9, P4 and seeing if the problems persist.
https://github.com/NuGet/Home/issues/13551 is tracked as https://github.com/dotnet/runtime/issues/103526. Retry always fixes the error so it's intermittent but consistently failing couple of builds a day.
I worked with @embetten from the Azure Artifacts team. I was told that the dotnet-public
feed has responded with the following status codes for the last 7 days.
Response Code | Count |
---|---|
303 | 63,421,011 |
200 | 52,874,790 |
404 | 8,246,998 |
409 | 233 |
401 | 17,498 |
202 | 472 |
500 | 33 |
Upon checking if there are any 404s for either 'System.Text.Json.7.0.3' or 'Castle.Core.5.1.1' packages, surprisingly no records were found in telemetry and IIS/AFD logs. Emily mentioned that one potential reason why the feed responded with a 404 might be that the AzDO service was still processing the package after its push while there was a request to download it, which would have failed with a 404. For example, the 'System.Text.Json.7.0.3' package was ingested into the feed on 06/13, and the issue was reported on the same day.
@kartheekp-ms @embetten
https://github.com/dotnet/sdk/issues/41590 lists recent failures, in particular: https://dev.azure.com/dnceng-public/public/_build/results?buildId=711675&view=logs&j=5960db8e-22cf-5780-65a0-f69bf3efbd20&t=663b2309-e924-59c7-4a84-d59cf2e17f86&l=1171 for example:
The feed 'net-sdk-supporting-feed [https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet9/nuget/v3/index.json]' lists package 'Roslyn.Diagnostics.Analyzers.3.11.0-beta1.24170.2'
However it was publishing on 3/20, https://dev.azure.com/dnceng/public/_artifacts/feed/dotnet9/NuGet/Roslyn.Diagnostics.Analyzers/overview/3.11.0-beta1.24170.2.
So I don't think timing is the likely issue. Can you maybe check out the 404s for that feed?
However it was publishing on 3/20, Can you maybe check out the 404s for that feed?
Yeah, it is the same case with 'Castle.Core.5.1.1' package for dotnet-public
feed where the package was ingested to the feed on Jan 5, 2023 but the NuGet error log says that package is not found.
Hey everyone,
I've done some more investigations with some help from @zivkan and @jeffkl, and while we don't have anything conclusive we do have a way forward.
Investigation summary:
As mentioned a few times, currently what's getting thrown is a PackageNotFoundException. https://github.com/NuGet/NuGet.Client/blob/79e6e486ee9f80296601485358ce6ac9f5e302d1/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L82-L88 This can happen in a situation, where the server returns 404, or 204, or 500 etc. If the http call is not successful, NuGet will retry up to 6 times: https://github.com/NuGet/NuGet.Client/blob/79e6e486ee9f80296601485358ce6ac9f5e302d1/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L260. When the calls are made, the code in https://github.com/NuGet/NuGet.Client/blob/79e6e486ee9f80296601485358ce6ac9f5e302d1/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L73-L76 won't get executed unless something was downloaded from the server.
Good news is that there's a lot of logging. We expect that 404s might be happen, and NuGet does log all http results, but it does so at normal verbosity. There are also messages in case retries happen: https://github.com/NuGet/NuGet.Client/blob/79e6e486ee9f80296601485358ce6ac9f5e302d1/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L287-L313. Unfortunately, these messages are not available to us for investigation right now, given that nearly all restores at minimal verbosity right now.
I did consider adding additional messages, but I believe there's plenty of information in the current logs (once we get the higher verbosity logs), to make further progress.
Beyond this, it'd also be helpful to know things like: How often are these issues happening? How many failures vs how many builds?
@nkolev92 I'm having trouble figuring out how to increase the log verbosity for the RestoreTask, would you mind showing which property you had in mind?
Note that I already posted the binlogs in the original comment in case you missed it.
I looked at the binlogs again and I don't see a GET request for the packages which fail later on, only for the index.json.
@akoeplinger
There's no special property, the cli argument should work.
so you're saying this will emit more data into the binlog?
No, my bad I should've been explicit. I was just answering that part of the question.
Checking now to see if I can understand how the Get/Cache message would never happen and yet we'd get to throw that exception.
The only way I could see the log message not showing and not have another message such as cancellation or a retry one, is if it somehow exited before hitting this code path: https://github.com/NuGet/NuGet.Client/blob/d1d2e260de9b8f20175e7766aa88e1ce1ece6b4e/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L258-L263.
I think we'll probably need to add some custom logging here.
C:\hostedtoolcache\windows\dotnet\sdk\9.0.100-preview.5.24307.3\NuGet.targets(170,5): error MSB4181: The "RestoreTask" task returned false but did not log an error. [D:\a\1\s\WeihanLi.Web.Extensions.sln]
I may meet the same issue, success with a retry
CI: https://weihanli.visualstudio.com/Pipelines/_build/results?buildId=6428&view=logs&j=12f1170f-54f2-53f3-20dd-22fc7dff55f9&t=7dae026f-9e99-5d65-075e-3f7579577f94&s=96ac2280-8cb4-5df5-99de-dd2da759617d build logs: https://weihanli.visualstudio.com/4e7b2e5c-fece-4f31-9ba4-4bc9d0f3fa3d/_apis/build/builds/6428/logs/11 Commit: https://github.com/WeihanLi/WeihanLi.Web.Extensions/commit/6fa2899fabbe67066ee7393abe504569f5dd505f
I think we'll probably need to add some custom logging here.
@nkolev92 is this being worked on? We can do some runs on AzDO with a patched nuget.client .dll if necessary
Kindly asking for an update on this issue. This is still affecting most of our builds.
So roughly about 100 hits in the last 5 days.
I was out unexpectedly all of last week. Resuming investigations today.
Catching up here, @ViktorHofer, I think https://github.com/dotnet/runtime/issues/103823 looks like it's a runtime issue and something likely unrelated to this particular problem, so I'm not tracking that one as a duplicate of this.
I think we'll probably need to add some custom logging here.
@nkolev92 is this being worked on? We can do some runs on AzDO with a patched nuget.client .dll if necessary
@akoeplinger
Yep, we'll have to do that. I'm adding some logging on a branch. I'll ping once I have a build.
Update: Added some logging and got some results. Unfortunately those are not enough to help us narrow down the issue. We're gonna try again.
We figured out the root cause: https://github.com/NuGet/NuGet.Client/pull/5905
I ran three sdk-unified-build-full
builds with this change and didn't hit the issue yet :)
Some more context for the followers of this issue:
NuGet Product Used
dotnet.exe
Product Version
6.11.0-rc.90
Worked before?
NuGet in dotnet sdk 9.0-preview4
Impact
It's more difficult to complete my work
Repro Steps & Context
Since we updated to a dotnet 9.0-preview6 SDK we started seeing RestoreTask failing randomly in the VMR build: https://github.com/dotnet/sdk/issues/41477#issuecomment-2154989462
After the fix in https://github.com/NuGet/NuGet.Client/pull/5845 we bumped again (i.e. we're now using a very recent nuget.client) we now see these messages
or
I've attached some binlogs: binlogs.zip
It's happening for random packages and given we only started seeing this when bumping the dotnet SDK (and thus nuget) I think it's more likely a nuget bug than an issue with the AzDO feed.
Is there some way we can enable more logging?
/cc @nkolev92 @kartheekp-ms
Verbose Logs
No response