Open filipnavara opened 2 years ago
Hi @filipnavara, thanks for raising this issue.
We don't see this as a workloads-specific issue, instead we see this as an issue with the way dotnet as an organization publishes packages on NuGet. We push a huge number of packages with each servicing release between workloads, various .NET Runtimes, and library releases, and several of these have similar timing issues as you describe here. Even if we had some ordering where we uploaded all workload packs prior to uploading the manifests, there's nothing enforcing on NuGet's end that indexing/CDN distribution/etc. keep that same ordering. The best advice I can give you is to delay a bit of time and try again in the general case.
The problem for individual NuGets is different since we are in control of the versions (which we update through explicit Dependabot PR, dotnet/arcade or other mechanism). However, with workloads there is no control unless we enforce a usage of a full custom manifest. So the moment the new manifest is pushed to NuGet it breaks our CI pipelines. During the publishing of .NET 6.0.2 the breakage lasted approximately 2 hours. I am rising the issue in an attempt to find a way how to minimize the disruption in future. While I can easily identify what is happening and what was the cause of the errors others may not.
I agree that workloads are a special case here because of the way advertising manifests are automatically updated during restore.
Maybe we can skip a manifest during that automatic update until it is at least X hours old?
FWIW we hit it again when 6.0.200 SDK was being published.
We install the SDK as part of the CI pipeline. So in addition to the order of NuGets being pushed the SDK acquisition process also gets the newer SDK first [before the NuGets are uploaded]. That also doesn't seem quite right.
Hitting this presently with the current rollout for the ios
workload.
sudo dotnet workload install ios
...
Workload(s) 'ios' are already installed.
Skipping NuGet package signature verification.
Installing workload manifest microsoft.net.sdk.ios version 16.2.1024…
Installing workload manifest microsoft.net.sdk.maccatalyst version 16.2.1024…
Installing workload manifest microsoft.net.sdk.macos version 13.1.1024…
Installing workload manifest microsoft.net.sdk.tvos version 16.1.1521…
Installing pack Microsoft.iOS.Sdk version 16.2.1024...
Writing workload pack installation record for Microsoft.iOS.Sdk.net7 version 16.2.1024...
Installing pack Microsoft.iOS.Sdk version 16.2.19...
Workload installation failed. Rolling back installed packs...
Rolling back pack Microsoft.iOS.Sdk installation...
Rolling back pack Microsoft.iOS.Sdk installation...
Uninstalling workload pack Microsoft.iOS.Sdk.net7 version 16.2.1024…
Workload installation failed: microsoft.ios.sdk::16.2.19 is not found in NuGet feeds https://api.nuget.org/v3/index.json".
On closer inspection, it appears that Microsoft.NET.Sdk.iOS.Manifest-7.0.100
version 16.2.1024
contains:
"packs": {
"Microsoft.iOS.Sdk.net7": {
"kind": "sdk",
"version": "16.2.1024",
"alias-to": {
"any": "Microsoft.iOS.Sdk"
}
},
"Microsoft.iOS.Sdk.net6": {
"kind": "sdk",
"version": "16.2.19",
"alias-to": {
"any": "Microsoft.iOS.Sdk"
}
},
And it appears that Microsoft.iOS.Sdk
version 16.2.1024
has been published, but version 16.2.19
has not - yet.
I suggest that there be some validation added such that a manifest can't be published until all of its dependent packages have been published first.
Also note that this breaks our CI/CD builds, since the current workloads are installed with each build. We can't build until the workload install succeeds.
Note also, even after it appeared on Nuget a while later, it's still not available until fully indexed.
Thus, the validation would need to really make sure it was available on nuget, not just sent there.
FYI, it's working again now that the indexing has completed. Hope the validation can be added so we don't encounter it again next time.
@baronfel @marcpopMSFT what are our options here? Is this something that can be controlled via nuget publishing today and we need to opt in? Or is this a wider discussion we need to have?
@rbhanda in case he has any ideas.
We could potentially control the order of publish to nuget for the runtime workloads so that the packs go first and then the manifests. We could also coordinate a delayed manifest publish. We'd still have a problem if someone caught one of the manifests but not the others since there are half a dozen of them and they got into a partial state. Does nuget.org have the ability to publish first but not make them visible so we could do that all at once?
Note that the above issue was the maui workload publishing and I don't know if that's managed by Rahul or my the maui team.
Yeah, the underlying issue here is that workloads conceptually provide a mechanism to tie multiple packages together as a unit and our nuget publishing pipeline has no concept of that dependency at any level. To really handle this we'd need something like a staged transaction to the nuget database.
NuGet can "unlist" packages which hides them from search results etc (and theoretically also from workloads assuming we correctly implement this).
As far as I know nuget.org doesn't support pushing a package in the unlisted state but there's an API to unlist that we could call as soon as we push the manifest package and then relist it later once all the other packages are pushed.
Or ask the NuGet team to add a ?publishUnlisted=true
parameter to the push API.
CC @JonDouglas @aortiz-msft for the interesting feature request (either batching or publish unlisted).
I'll ping @rbhanda offline and see if publishing, unlisting, and then later listing is a potential option here.
We could potentially control the order of publish to nuget for the runtime workloads
This is definitely something that we can look into though, as noted, the package index status seems to be crux of the issue.
This would be a NuGet.org ask so tagging @joelverhagen and @clairernovotny
As @baronfel mentioned availability order of packages on NuGet.org is not guaranteed, even if packages are pushed in some clever (e.g. reverse dependency) order. This is because our asynchronous validation pipeline (malware scanning, signing, and more) can take a different amount of time per package. Imagine there is a package in the middle of the dependency graph that takes a bit longer to malware scan or is owned by another team that runs their push at a slightly different time (this happens more than you'd think). It's sort of a messy problem.
It's not currently feasible to fix the availability order unless the actor pushing the package polls for package availability before continuing with other package pushes. If done naively this would have horrible throughput (avg validation time * total number of packages to push, so many hours for big .NET releases). It would be possible to essentially do a topological sort to identify sets of packages that can be pushed together but this is all a hack/workaround for the feature gap on NuGet.org.
This general feature request is already tracked here https://github.com/NuGet/NuGetGallery/issues/3931. Feel free to add additional comments or upvote. I think the described solution ("staging") is the Proper fix for the problem, but it's a lot of work and needs some exploration with stakeholders to make sure the staging works as needed. I can say it's not on our radar for the next 6 months given our other priorities and team capacity.
We could potentially control the order of publish to nuget for the runtime workloads so that the packs go first and then the manifests. We could also coordinate a delayed manifest publish. We'd still have a problem if someone caught one of the manifests but not the others since there are half a dozen of them and they got into a partial state. Does nuget.org have the ability to publish first but not make them visible so we could do that all at once?
Responding to @marcpopMSFT's https://github.com/dotnet/sdk/issues/23820#issuecomment-1421757609, it is possible to publish packages as unlisted. This can be done by using the "unlist" gesture (nuget.exe, .NET CLI, API endpoint) immediately after pushing, while the package is still in the validating state. This will always be the case since validation takes 2+ minutes and you can unlist immediately after the push request completes. Then, after you have detected that packages should be listed (i.e. the fully dependency graph is available), you can use the "relist" gesture (no CLI support, but has API support).
As far as I know nuget.org doesn't support pushing a package in the unlisted state but there's an API to unlist that we could call as soon as we push the manifest package and then relist it later once all the other packages are pushed.
Responding to @akoeplinger's https://github.com/dotnet/sdk/issues/23820#issuecomment-1424349420, yes that's right. As mentioned to Marc above, you can unlist immediately after push so it can work today. Feel free to open an issue about enhancing the push protocol to include a "listed = false" parameter. This would remove the additional round trip, eliminate any crazy race condition caused by a slow/errored unlist or hyper fast validation, and provide parity with the UI upload flow which currently allows uploading a package as unlisted.
FYI, the best way to detect if a version is available is to use this API endpoint (a HEAD
request should be fine):
https://learn.microsoft.com/en-us/nuget/api/package-base-address-resource#download-package-content-nupkg. This endpoint has good performance and availability globally and properly caches 200s but not 404s.
Thanks for the detailed response @joelverhagen. Glad to know there's a 2 minute validation window during which we can unlist.
I met with our release team and we have a low cost proposal for the next release. The plan is to publish all .NET packages excluding the manifest packages, poll for availability (which I understand they already do), and then publish the manifest packages as a separate step afterwards. There is potential delay from the polling and there is still the potential for customer impact since the dozen or so manifests would still potentially light up at different times for different customers but it would reduce that window from 30+ minutes down to a much smaller window.
If that goes well, we'll continue with that option. If there are still issues, we can explore the option of pushing unlisted and then listing all of the manifests together. I'll be meeting with some Maui folks later today and suggesting they follow the same pattern as their publish is a separate process from the core .NET package publish today.
Hitting this again presently.
Workload update failed: microsoft.ios.sdk::16.2.29 is not found in NuGet feeds https://api.nuget.org/v3/index.json".
Related to rollout of https://www.nuget.org/packages/Microsoft.NET.Sdk.iOS.Manifest-7.0.100/16.2.1040
Should be resolved when indexing is complete.
Working again now.
@marcpopMSFT this seems to be happening again with today's release, is the proposed mitigation you outlined supposed to happen this release already or in some future release?
Thanks for the detailed response @joelverhagen. Glad to know there's a 2 minute validation window during which we can unlist.
I met with our release team and we have a low cost proposal for the next release. The plan is to publish all .NET packages excluding the manifest packages, poll for availability (which I understand they already do), and then publish the manifest packages as a separate step afterwards. There is potential delay from the polling and there is still the potential for customer impact since the dozen or so manifests would still potentially light up at different times for different customers but it would reduce that window from 30+ minutes down to a much smaller window.
If that goes well, we'll continue with that option. If there are still issues, we can explore the option of pushing unlisted and then listing all of the manifests together. I'll be meeting with some Maui folks later today and suggesting they follow the same pattern as their publish is a separate process from the core .NET package publish today.
Working again now. Is this method used this time when publishing 16.2.46 ?
@akoeplinger @liejuntao001 can you confirm which component was out of order? We're working with release teams but it's mostly a manual process to ensure that the manifests get published last and that may have been missed this release.
Some reports I saw:
Workload installation failed: microsoft.netcore.app.runtime.mono.android-x64::6.0.16 is not found in NuGet feeds https://api.nuget.org/v3/index.json".
Workload installation failed: Workload manifest dependency 'Microsoft.NET.Workload.Emscripten.net6' version '7.0.4' is lower than version '7.0.5' required by manifest 'microsoft.net.workload.mono.toolchain.net6' [/Users/runner/hostedtoolcache/dotnet/sdk-manifests/7.0.100/microsoft.net.workload.mono.toolchain.net6/WorkloadManifest.json]
poll for availability (which I understand they already do), and then publish the manifest packages as a separate step afterwards.
@marcpopMSFT can you share some details on the "poll for availability" step? Is there an example of how the .NET release team did it?
This is not implemented in any of the MAUI release pipelines yet.
I saw these from our Azure pipeline builds:
2023-04-11T14:22:03.8990170Z Workload installation failed: microsoft.netcore.app.runtime.mono.android-x64::6.0.16 is not found in NuGet feeds https://api.nuget.org/v3/index.json".
2023-04-11T15:10:42.2530970Z Workload installation failed: microsoft.ios.sdk::16.2.46 is not found in NuGet feeds https://api.nuget.org/v3/index.json".
2023-04-11T15:43:07.5734510Z Workload installation failed: microsoft.ios.sdk::16.2.46 is not found in NuGet feeds https://api.nuget.org/v3/index.json".
later worked at 2023-04-11T17:21:17.8915980Z Successfully installed workload(s) android ios maui wasm-tools.
The fact we have some reports of:
2023-04-11T14:22:03.8990170Z Workload installation failed: microsoft.netcore.app.runtime.mono.android-x64::6.0.16 is not found in NuGet feeds https://api.nuget.org/v3/index.json".
Seems like maybe the .NET release side wasn't working either?
I just checked with our release team. It looks like they had not recorded the outcome of our planned release ordering in their release steps and so had missed delaying the manifest publish until after all of the other packages are published. They plan on updating that process and hopefully this will be improved in May.
For summary, the plan is to take the manifests out of the normal release process, publish all other packages, and then publish the manifests. This doesn't guarantee that everything will work as it depends on when the packages become available on nuget.org but will greatly reduce the window where things might be broken (current publish takes >30 minutes and the manifests were publishing near the beginning of that process creating a large window where all workload commands would fail). We're still asking for a feature from nuget to link packages together and make them all available at the same time.
@marcpopMSFT this plan didn't quite work for us in practice -- we published the Android manifest last, and yet it was the first available on NuGet.org. It is just such a smaller package, I think it made it through the queue quickly.
So, I'm interested in how we "poll for availability" on NuGet. We could also put a 30 min. delay in the release pipeline, but I'm not sure that is a 100% solution. I believe NuGet was quite busy this last release day, and it took around an hour for things to resolve.
Is there any way an internal validation check could be added to dotnet workload install/update
commands such that the previous version of a workload would continue to be used until the new version is fully published? It seems like it could send HEAD
requests to nuget (as mentioned before) to validate the manifest before attempting to install any packages. If any package in the latest manifest isn't available, then publishing is in-progress so use the previous version and continue without failure. This would keep end-user CI builds from breaking, regardless of what changes are made to the publishing workflow.
We found an HttpClient
example that could be used in a release pipeline:
https://github.com/xamarin/AndroidX/blob/main/build/scripts/update-config.csx#L216-L224
It could also run in a script via curl, etc.
I think that HttpClient example will download the whole package if it is found. I tried sending a HEAD request to that nuget v2 api url but it didn't work (returned 404 for an existing package)...
We'll probably need to use the v3 api instead which is a bit more involved since you need to request the index.json and then follow the RegistrationBaseURL: https://learn.microsoft.com/en-us/nuget/api/registration-base-url-resource
It might be worth raising this with the nuget.org team and seeing what they recommend. http polling could potentially work but feels a bit clunky.
For workload install, we install directly over the existing manifest and only then know what packs to look for. We'd have to do something complicated like install the manifests to a staging location, check if the packs are available, then try a different manifest, and repeat until we found one that had packs. That's complexity I'd like to avoid.
We could potentially generate some sort of meta-manifest that includes the manifest versions and the pack versions in it and check for all of them being available first. That's only marginal complexity over some other plans we've been discussing but it does make that file more complicated (ie it'll have dozens of versions instead of just the ~8 manifest versions).
If a specific publishing order is required for your scenario, then polling is the best option today. As mentioned above https://github.com/dotnet/sdk/issues/23820#issuecomment-1428335804, this maybe is not needed if publishing as unlisted is good enough, and then relisting later (some explicit, user defined "release time").
We'll probably need to use the v3 api instead which is a bit more involved since you need to request the index.json and then follow the RegistrationBaseURL: https://learn.microsoft.com/en-us/nuget/api/registration-base-url-resource
Using V3 is recommended, yes. I recommend doing a HEAD on the .nupkg URL: https://learn.microsoft.com/en-us/nuget/api/package-base-address-resource#download-package-content-nupkg Or checking of the version is in the version list: https://learn.microsoft.com/en-us/nuget/api/package-base-address-resource#enumerate-package-versions
If you want to avoid writing the protocol client yourself, you can use the client SDK, which does the service index caching and URL building for you: https://learn.microsoft.com/en-us/nuget/reference/nuget-client-sdk#list-package-versions
Is today Jun 13, 2023 another release day?
Error:
2023-06-13T13:41:47.7597280Z Installing workload manifest microsoft.net.sdk.android version 32.0.485…
2023-06-13T13:41:48.0692910Z Installing workload manifest microsoft.net.sdk.ios version 16.4.47…
2023-06-13T13:41:48.2032290Z Installing workload manifest microsoft.net.sdk.maccatalyst version 16.4.47…
2023-06-13T13:41:48.3330710Z Installing workload manifest microsoft.net.sdk.macos version 13.3.47…
2023-06-13T13:41:48.4611660Z Installing workload manifest microsoft.net.sdk.maui version 6.0.553…
2023-06-13T13:41:48.5974130Z Installing workload manifest microsoft.net.sdk.tvos version 16.4.47…
2023-06-13T13:41:48.7284560Z Installing workload manifest microsoft.net.workload.mono.toolchain version 6.0.18…
2023-06-13T13:41:48.8666250Z Installing workload manifest microsoft.net.workload.emscripten version 6.0.16…
2023-06-13T13:41:48.9988440Z Workload installation failed. Rolling back installed packs...
2023-06-13T13:41:49.0023930Z Workload installation failed: Workload manifest dependency 'Microsoft.NET.Workload.Emscripten' version '6.0.16' is lower than version '6.0.18' required by manifest 'microsoft.net.workload.mono.toolchain' [/Users/runner/hostedtoolcache/dotnet/sdk-manifests/6.0.400/microsoft.net.workload.mono.toolchain/WorkloadManifest.json]
2023-06-13T13:41:49.1647860Z Unhandled exception: Microsoft.NET.Sdk.WorkloadManifestReader.WorkloadManifestCompositionException: Workload manifest dependency 'Microsoft.NET.Workload.Emscripten' version '6.0.16' is lower than version '6.0.18' required by manifest 'microsoft.net.workload.mono.toolchain' [/Users/runner/hostedtoolcache/dotnet/sdk-manifests/6.0.400/microsoft.net.workload.mono.toolchain/WorkloadManifest.json]
2023-06-13T13:41:49.2585920Z at Microsoft.NET.Sdk.WorkloadManifestReader.WorkloadResolver.ComposeWorkloadManifests()
2023-06-13T13:41:49.2871030Z at Microsoft.NET.Sdk.WorkloadManifestReader.WorkloadResolver.Create(IWorkloadManifestProvider manifestProvider, String dotnetRootPath, String sdkVersion, String userProfileDir)
2023-06-13T13:41:49.2878890Z at Microsoft.DotNet.Workloads.Workload.InstallingWorkloadCommand..ctor(ParseResult parseResult, IReporter reporter, IWorkloadResolver workloadResolver, IInstaller workloadInstaller, INuGetPackageDownloader nugetPackageDownloader, IWorkloadManifestUpdater workloadManifestUpdater, String dotnetDir, String userProfileDir, String tempDirPath, String version, String installedFeatureBand)
2023-06-13T13:41:49.2886100Z at Microsoft.DotNet.Workloads.Workload.Install.WorkloadInstallCommand..ctor(ParseResult parseResult, IReporter reporter, IWorkloadResolver workloadResolver, IInstaller workloadInstaller, INuGetPackageDownloader nugetPackageDownloader, IWorkloadManifestUpdater workloadManifestUpdater, String dotnetDir, String userProfileDir, String tempDirPath, String version, IReadOnlyCollection`1 workloadIds, String installedFeatureBand)
2023-06-13T13:41:49.2897470Z at Microsoft.DotNet.Cli.WorkloadInstallCommandParser.<>c.<ConstructCommand>b__6_0(ParseResult parseResult)
2023-06-13T13:41:49.2903380Z at Microsoft.DotNet.Cli.ParseResultCommandHandler.Invoke(InvocationContext context)
2023-06-13T13:41:49.2916790Z at System.CommandLine.Invocation.InvocationPipeline.<>c__DisplayClass4_0.<<BuildInvocationChain>b__0>d.MoveNext()
2023-06-13T13:41:49.2930070Z --- End of stack trace from previous location ---
2023-06-13T13:41:49.2933350Z at System.CommandLine.CommandLineBuilderExtensions.<>c__DisplayClass12_0.<<UseHelp>b__0>d.MoveNext()
2023-06-13T13:41:49.2936460Z --- End of stack trace from previous location ---
2023-06-13T13:41:49.2939990Z at System.CommandLine.CommandLineBuilderExtensions.<>c.<<UseSuggestDirective>b__18_0>d.MoveNext()
2023-06-13T13:41:49.2943050Z --- End of stack trace from previous location ---
2023-06-13T13:41:49.2945930Z at System.CommandLine.CommandLineBuilderExtensions.<>c__DisplayClass16_0.<<UseParseDirective>b__0>d.MoveNext()
2023-06-13T13:41:49.2950020Z --- End of stack trace from previous location ---
2023-06-13T13:41:49.2952810Z at System.CommandLine.CommandLineBuilderExtensions.<>c__DisplayClass8_0.<<UseExceptionHandler>b__0>d.MoveNext()
2023-06-13T13:41:49.3056080Z ##[error]Bash exited with code '1'.
@liejuntao001 yes, today is another monthly servicing release
This problem also shows up with version wildcards and transitive dependencies in our regular dev pipelines, nuget really needs a transactional publish https://github.com/dotnet/performance/issues/3164
Any updates on this issue? This is still causing lost work hours for me and my team every time a workload update is released.
With the nature of how NuGet works, I understand that despite the order that packages are pushed they may well become available in a different order.
@marcpopMSFT I understand resisting complexity, but surely the need for an always working solution outweighs the any issues caused by any added complexity. This issue has become increasingly frustrating.
@joelverhagen Perhaps NuGet might add a flag for packages to make them unavailable unless their complete dependency tree is resolvable.
@rbhanda who manages the release. We should be pushing all of the packs before pushing any of the manifests to avoid this but I believe there can still be CDN availability issues. I'm not sure how to solve for sure without a bunch of nuget.org side features.
@JonDouglas to comment on the possibility of either pushing unlisted packages and then listing or pushing multiple related packages as a unit.
today's another release day?
It is just such a smaller package, I think it made it through the queue quickly.
can push manifest package in completely separate "next" step in ci, contingent on "push everything else in random order" step?
today's another release day?
No. It's always second Tuesday of the month.
must be some other explanation for this error
Workload installation failed: Version 16.4.7142 of package microsoft.ios.sdk is not found in NuGet feeds https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet8/nuget/v3/index.json;https://api.nuget.org/v3/index.json".
it was updated two weeks ago https://github.com/xamarin/xamarin-macios/pull/20348. our workaround was to download https://github.com/xamarin/xamarin-macios/blob/main/NuGet.config in current directory, then run sudo dotnet workload restore
Describe the bug
When new .NET service release is pushed a lot of packages are pushed to the NuGet feed. Among those packages are the workload manifests and the workload packages referenced from those manifests. There is currently no enforced order in which these assets are uploaded which can cause the manifest feed to refer to non-existent packages. This will cause temporary failures on
dotnet workload install
command.Example: