dotnet / arcade

Tools that provide common build infrastructure for multiple .NET Foundation projects.
MIT License
671 stars 349 forks source link

Failures to restore from Azure artifacts feeds due to throttling #4190

Closed riarenas closed 4 years ago

riarenas commented 5 years ago

We have started seeing some throttling errors when attempting to restore NuGet packages from Azure Artifacts that manifest like this:

/root/coresetup/.dotnet/sdk/3.0.100/NuGet.targets(123,5): error : Failed to download package 'transport.runtime.linux-musl-x64.Microsoft.NETCore.Jit.3.1.0-preview1.19504.3' from 'https://pkgs.dev.azure.com/dnceng/9ee6d478-d288-47f7-aacc-f6e6d082ae6d/_packaging/9c2ea29a-00e0-4bae-b470-161fdab1f360/nuget/v3/flat2/transport.runtime.linux-musl-x64.microsoft.netcore.jit/3.1.0-preview1.19504.3/transport.runtime.linux-musl-x64.microsoft.netcore.jit.3.1.0-preview1.19504.3.nupkg'.
/root/coresetup/.dotnet/sdk/3.0.100/NuGet.targets(123,5): error : Response status code does not indicate success: 429 (Request was blocked due to exceeding usage of resource 'Concurrency' in namespace 'IPAddress'. For more information on why your request was blocked, see the topic "Rate limits" on the Microsoft Web site (https://go.microsoft.com/fwlink/?LinkId=823950). (DevOps Activity ID: 5B41D91F-6ED5-41D1-814B-0328F8821422)).
##[error]/root/coresetup/.dotnet/sdk/3.0.100/NuGet.targets(123,5): error : Failed to download package 'transport.runtime.linux-musl-x64.Microsoft.NETCore.Jit.3.1.0-preview1.19504.3' from 'https://pkgs.dev.azure.com/dnceng/9ee6d478-d288-47f7-aacc-f6e6d082ae6d/_packaging/9c2ea29a-00e0-4bae-b470-161fdab1f360/nuget/v3/flat2/transport.runtime.linux-musl-x64.microsoft.netcore.jit/3.1.0-preview1.19504.3/transport.runtime.linux-musl-x64.microsoft.netcore.jit.3.1.0-preview1.19504.3.nupkg'.

This does not appear to be causing widespread failures so far, but as we increase our reliance on these feeds, we're starting to get more reports.

Example builds where this has been seen: https://dev.azure.com/dnceng/public/_build/results?buildId=398862 https://dnceng.visualstudio.com/internal/_build/results?buildId=384184

All unauthenticated NuGet requests are getting thrown into the same throttling bucket by AzDO, and due to the multi-feed lookup for each package that NuGet does, depending on how many feeds you have in your NuGet config, and how many packages you need to restore, the problem gets worse.

The short term suggestions from the AzDO team are:

We are still in talks with the AzDO team as these workarounds will end up requiring a lot of changes to our infrastructure for this.

JohnTortugo commented 5 years ago

Happened again here: https://dev.azure.com/dnceng/public/_build/results?buildId=399891&view=logs&j=09811346-8274-5b72-2d96-1dd38f87c84b

riarenas commented 5 years ago

Took me a while to track the exact configuration for NuGet.config (it's not documented yet). Updated the issue description with it:

    <config>
        <add key='maxHttpRequestsPerSource' value='2' />
    </config>
riarenas commented 5 years ago

According to AzDO telemetry a large number of our AzDO buildpool machines are being detected as being the same machine, which makes it so they all get into the same throttling bucket, making things a lot worse for our BYOC pools than it is for hosted.

riarenas commented 5 years ago

As a short-term solution, the concurrency limits for the dnceng instance has been increased and it seems to have helped somewhat.

The Azure DevOps team has found that we are now hitting some other throttling limits, and are in the process of investigating.

The AzDO team is also considering some long-term solutions like filling the gaps that are blocking us from adopting upstream feeds, which would reduce the number of feeds we need to specify in our repos' NuGet.config, and looking at improvements with the NuGet team.

jashook commented 4 years ago

Runtime-coreclr outerloop hit this multiple times yesterday.

https://dev.azure.com/dnceng/public/_build/results?buildId=447821 https://dev.azure.com/dnceng/public/_build/results?buildId=447950 https://dev.azure.com/dnceng/public/_build/results?buildId=448056 https://dev.azure.com/dnceng/public/_build/results?buildId=448358 https://dev.azure.com/dnceng/public/_build/results?buildId=448386 https://dev.azure.com/dnceng/public/_build/results?buildId=448795

JohnTortugo commented 4 years ago

@riarenas - Is this still an thing?

riarenas commented 4 years ago

Yes. We have had some quota increases to help with this in the short term, but we haven't heard back with a long term solution.

riarenas commented 4 years ago

We've received additional reports of 429s during restore operations in attempt 1 of these builds:

https://dev.azure.com/dnceng/internal/_build/results?buildId=587047&_a=summary https://dev.azure.com/dnceng/internal/_build/results?buildId=587046&_a=summary https://dev.azure.com/dnceng/internal/_build%2Fresults?buildId=587045&_a=summary

I reached out in the thread we had with the azure artifacts group about throttling.

CC @wtgodbe

riarenas commented 4 years ago

The Azure Artifacts team said the problems on 4/3 were due to dnceng using 60% of the traffic for their scale units when they were completely scaled down. Additionally, yesterday we saw IP throttling come back in a lot more cases:

Using runfo I was able to find these from runtime, but we have additional reports from Roslyn, where the error was reported as a timeout instead of a build failure error.

Build Kind Timeline Record
592529 PR https://github.com/dotnet/runtime/pull/34666 Build System.Private.CoreLib
592529 PR https://github.com/dotnet/runtime/pull/34666 Build System.Private.CoreLib
592488 PR https://github.com/dotnet/runtime/pull/34519 Build product
592488 PR https://github.com/dotnet/runtime/pull/34519 Build managed product components and packages
592488 PR https://github.com/dotnet/runtime/pull/34519 Build managed product components and packages
592488 PR https://github.com/dotnet/runtime/pull/34519 Build managed product components and packages
592482 PR https://github.com/dotnet/runtime/pull/34054 Restore and Build Product
592482 PR https://github.com/dotnet/runtime/pull/34054 Restore and Build Product
592437 PR https://github.com/dotnet/runtime/pull/34522 Restore and Build Product
592437 PR https://github.com/dotnet/runtime/pull/34522 Build CoreCLR Runtime
592437 PR https://github.com/dotnet/runtime/pull/34522 Restore and Build Product
592437 PR https://github.com/dotnet/runtime/pull/34522 Restore and Build Product
592437 PR https://github.com/dotnet/runtime/pull/34522 Build CoreCLR Runtime
592437 PR https://github.com/dotnet/runtime/pull/34522 Restore and Build Product
592404 Rolling Prepare TestHost with runtime CoreCLR
592404 Rolling Build System.Private.CoreLib
592404 Rolling Build product
592404 Rolling Build System.Private.CoreLib
592404 Rolling Build managed product components and packages
592417 PR https://github.com/dotnet/runtime/pull/34663 Restore and Build Product
592417 PR https://github.com/dotnet/runtime/pull/34663 Build managed product components and packages
592417 PR https://github.com/dotnet/runtime/pull/34663 Restore and Build Product
592415 PR https://github.com/dotnet/runtime/pull/34665 Build managed product components and packages
592415 PR https://github.com/dotnet/runtime/pull/34665 Build System.Private.CoreLib
592415 PR https://github.com/dotnet/runtime/pull/34665 Build managed product components and packages
592105 PR https://github.com/dotnet/runtime/pull/34654 Build product
592105 PR https://github.com/dotnet/runtime/pull/34654 Build product
592105 PR https://github.com/dotnet/runtime/pull/34654 Restore and Build
592105 PR https://github.com/dotnet/runtime/pull/34654 Build System.Private.CoreLib
592105 PR https://github.com/dotnet/runtime/pull/34654 Build System.Private.CoreLib
592105 PR https://github.com/dotnet/runtime/pull/34654 Build System.Private.CoreLib
592194 PR https://github.com/dotnet/runtime/pull/34658 Restore and Build Product
592194 PR https://github.com/dotnet/runtime/pull/34658 Restore blob feed tasks
592194 PR https://github.com/dotnet/runtime/pull/34658 Build managed product components and packages
592187 PR https://github.com/dotnet/runtime/pull/34518 Build System.Private.CoreLib
592187 PR https://github.com/dotnet/runtime/pull/34518 Restore and Build Product
592313 PR https://github.com/dotnet/runtime/pull/34662 Restore and Build Product
592313 PR https://github.com/dotnet/runtime/pull/34662 Restore and Build Product
592295 PR https://github.com/dotnet/runtime/pull/34661 Restore and Build Product
592295 PR https://github.com/dotnet/runtime/pull/34661 Restore and Build Product
592380 PR https://github.com/dotnet/runtime/pull/34664 Build product
592380 PR https://github.com/dotnet/runtime/pull/34664 Build System.Private.CoreLib
592380 PR https://github.com/dotnet/runtime/pull/34664 Build System.Private.CoreLib
592252 PR https://github.com/dotnet/runtime/pull/34659 Build managed product components and packages
592272 PR https://github.com/dotnet/runtime/pull/34521 Build product
592272 PR https://github.com/dotnet/runtime/pull/34521 Build product
592272 PR https://github.com/dotnet/runtime/pull/34521 Restore and Build Product
592272 PR https://github.com/dotnet/runtime/pull/34521 Build product
592272 PR https://github.com/dotnet/runtime/pull/34521 Restore and Build Product
592192 PR https://github.com/dotnet/runtime/pull/34274 Build
592192 PR https://github.com/dotnet/runtime/pull/34274 Build System.Private.CoreLib
592192 PR https://github.com/dotnet/runtime/pull/34274 Build
592052 PR https://github.com/dotnet/runtime/pull/34432 Restore and Build Product
592052 PR https://github.com/dotnet/runtime/pull/34432 Build managed product components and packages
592040 PR https://github.com/dotnet/runtime/pull/33902 Restore and Build Product
592040 PR https://github.com/dotnet/runtime/pull/33902 Restore and Build Product
592080 Rolling Build product
592080 Rolling Build product
592080 Rolling Build product
592080 Rolling Restore and Build Product
592037 PR https://github.com/dotnet/runtime/pull/34652 Build
592037 PR https://github.com/dotnet/runtime/pull/34652 Restore and Build Product
592037 PR https://github.com/dotnet/runtime/pull/34652 Restore and Build Product
592032 PR https://github.com/dotnet/runtime/pull/34651 Restore and Build Product
591984 PR https://github.com/dotnet/runtime/pull/33733 Build managed test components
591984 PR https://github.com/dotnet/runtime/pull/33733 Build
591984 PR https://github.com/dotnet/runtime/pull/33733 Build
591984 PR https://github.com/dotnet/runtime/pull/33733 Build
592074 PR https://github.com/dotnet/runtime/pull/34211 Build managed product components and packages
592074 PR https://github.com/dotnet/runtime/pull/34211 Restore and Build Product
592074 PR https://github.com/dotnet/runtime/pull/34211 Restore and Build Product
592074 PR https://github.com/dotnet/runtime/pull/34211 Restore and Build Product
592074 PR https://github.com/dotnet/runtime/pull/34211 Build System.Private.CoreLib
592061 PR https://github.com/dotnet/runtime/pull/34650 Restore and Build Product
592061 PR https://github.com/dotnet/runtime/pull/34650 Restore and Build Product
592061 PR https://github.com/dotnet/runtime/pull/34650 Restore and Build Product
592061 PR https://github.com/dotnet/runtime/pull/34650 Build managed product components and packages
592061 PR https://github.com/dotnet/runtime/pull/34650 Build managed product components and packages
592061 PR https://github.com/dotnet/runtime/pull/34650 Restore and Build Product
592402 PR https://github.com/dotnet/roslyn/pull/43152 Build
riarenas commented 4 years ago

I asked the artifacts team for increased quota as we're ramping up usage of the feeds.

safern commented 4 years ago

@riarenas thanks for looking into this, do we have an ETA?

would it make sense to bring back the dotnet blob feed back as a restore source in the meantime?

riarenas commented 4 years ago

No ETA.

I'll create PRs to re-add dotnet-core as a backup if we don't hear from them soon.

safern commented 4 years ago

Thanks @riarenas

riarenas commented 4 years ago

AzDO folks have increased our limits. I'll keep this in FR for a bit to see if this gives us some relief, and move it back to general tracking afterwards, as our feed usage is only going to increase in the near term. (we haven't moved ASPNet or Installer to relying entirely on azdo feeds yet)

riarenas commented 4 years ago

The new limits seem to have stuck. Haven't seen any more throttling during restore since the limits were raised. I'll remove this from FR.

The AzDO team said they are evaluating more sustainable options to handle our load. I'll keep this open because I think if we onboarded another big repo to only using azdo feeds, we'd start seeing this again.

nattress commented 4 years ago

I'm seeing this today in the runtime CI:

https://dev.azure.com/dnceng/public/_build/results?buildId=623702&view=logs&jobId=a21a5f2c-b5cf-50e3-b899-59e91a62c520&j=240def99-ab66-58b8-7d7e-e9b53327ef3c&t=0d65e4e8-9786-5617-5520-c80e454608a0

riarenas commented 4 years ago

We have reached out to the azure artifacts team again for options, https://github.com/dotnet/core-eng/issues/9681

markwilkie commented 4 years ago

Ok - AzDO is saying they've fixed it (for real) now.

riarenas commented 4 years ago

After the recent AzDO changes it doesn't look like we'll be easily throttled again, so I don't think there's much worth in keeping this long standing issue open anymore. We can open new issues for any sporadic throttling we see.