Closed ulisesh closed 1 year ago
Issue on Nuget side https://github.com/NuGet/Home/issues/10558
Created https://portal.microsofticm.com/imp/v3/incidents/details/344927997/home to ask for concurrency limit increases
IcM got us 2x concurrency bump, and no new instances since 12:25 PDT thus far. Will keep an eye on it.
@dougbu noted this hit again yesterday here: https://dev.azure.com/dnceng-public/public/_build/results?buildId=72672&view=logs&jobId=b72e85ab-3386-5aa9-6405-3837662d9688&j=8e9e2812-09db-5a5d-c249-d96adf9f47fa&t=51e8c4b1-e760-5f88-6965-6fecd74b90ed&s=6884a131-87da-5381-61f3-d7acc3b91d76
What we're doing:
/fyi @dotnet/aspnet-build
I just created this PR to actually retry on HTTP 429 responses. But it doesn't (yet) honor the Retry-After
header, so I'm not sure if it'll actually make things worse 🤷
Anyway, if anyone would like to provide feedback: https://github.com/NuGet/NuGet.Client/pull/4897
I had another thought. From reading the docs on nuget.config's maxHttpRequestsPerSource setting, it would appear that dotnet restore
will not throttle requests to a source at all, given that HttpClient in netcoreapp
runtimes don't throttle connections, unlike the .NET Framework runtime. Whatever the maximum number of packages NuGet needs to check in the restore graph in parallel, it will theoretically send out the same number of requests per source.
Therefore, if dotnet/* repos set maxHttpRequestsPerSource
in their nuget.config, this may mitigate the amount of traffic individual CI builds send. If tests are run from a subdirectory under the repo root, then the nuget.config will apply to the tests as well, but if tests run elsewhere (like %temp%), then additional effort needs to be made to have the tests also throttle requests (keeping in mind it will be per process, just in case tests shell out to dotnet restore
or dotnet build
in parallel).
I had another thought. From reading the docs on nuget.config's maxHttpRequestsPerSource setting, it would appear that
dotnet restore
will not throttle requests to a source at all, given that HttpClient innetcoreapp
runtimes don't throttle connections, unlike the .NET Framework runtime. Whatever the maximum number of packages NuGet needs to check in the restore graph in parallel, it will theoretically send out the same number of requests per source.Therefore, if dotnet/* repos set
maxHttpRequestsPerSource
in their nuget.config, this may mitigate the amount of traffic individual CI builds send. If tests are run from a subdirectory under the repo root, then the nuget.config will apply to the tests as well, but if tests run elsewhere (like %temp%), then additional effort needs to be made to have the tests also throttle requests (keeping in mind it will be per process, just in case tests shell out todotnet restore
ordotnet build
in parallel).
@zivkan This would be an interesting experiment to run. Is there telemetry that could be enabled that would help illuminate the quantity of traffic from NuGet operations?
System.Net.Http (including HttpClient) uses EventSource, and in the future (but not yet, unfortunately) NuGet will as well. So on Windows anything that can record ETW events, like PerfView, can collect the information. Docs say that on .NET Core EventSource uses EventPipe, which supports LTTng, as well as tools like dotnet-counters
and dotnet-trace
, but I have zero experience with these.
But NuGet doesn't have any relevant telemetry (we only have telemetry in VS, nothing on command line. Even in VS, we need to limit telemetry quantity, so logging every HTTP request is out of the question, only aggregated telemetry is feasible).
I think you'd just be interested in raw totals:
And perhaps total time spent in NuGet related tasks during restore? That would give you a good set of base statistics to determine whether improvements were being made without regressing perf.
With a binlog, or text restore log at normal verbosity, we could calculate the number of gets and 404's with a grep of GET http://
and NotFound http://
.
However, I wouldn't expect that maxHttpRequestsPerSource
(throttling) would have much of an impact on the total number of requests. It's more about how quickly a single process will hammer a server.
If restore is run as a separate step in CI, then CI logs should tell us how long the restore was. Otherwise, if CI logs the timestamp (to sufficient accuracy) for each line, we could look at the time difference between the "Determining projects to restore" message and the last "Restored c:\path\to\my.csproj" message.
We can (and I have in the past) written a program to use AzDO's REST APIs to download build info, even build logs or build artifacts. But as previously mentioned, we don't have any command line telemetry. So, not super easy to query/do the analysis, certainly nothing "at scale" (across multiple repos/build pipelines).
As it has been a week since we have seen this issue, we believe we have addressed this issue.
Build
https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=12817
Build leg reported
Microsoft.NET.Build.Tests.dll.1.WorkItemExecution
Pull Request
https://github.com/dotnet/sdk/pull/27821
Action required for the engineering services team
To triage this issue (First Responder / @dotnet/dnceng):
If this is an issue that is causing build breaks across multiple builds and would get benefit from being listed on the build analysis check, follow the next steps:
Edit this issue and add an error string in the Json below that can help us match this issue with future build breaks. You should use the known issues documentation
Tracking Nuget 429s
Additional information about the issue reported
No response
Report
Displaying 100 of 258 results
Summary