NuGet / Home

Repo for NuGet Client issues
Other
1.49k stars 250 forks source link

dotnet restore takes too much time (docker dotnet/sdk:8.0) #13062

Closed mcanzerini closed 8 months ago

mcanzerini commented 9 months ago

NuGet Product Used

dotnet.exe

Product Version

docker dotnet/sdk 8.0

Worked before?

docker dotnet/sdk 7.0

Impact

I'm unable to use this version

Repro Steps & Context

The Bug

With dotnet/sdk:8.0, dotnet restore in a container takes way too much time (45 minutes to restore a project).

This only happens in my gitlab ci, not in local. It is probably due to limited resources available for the gitlab runner. The ops in charge of the runner tells my that the CPU usage was above 100% during the long restore.

$ dotnet restore --no-cache
Restored /builds/**/A.csproj (in 10.11 min).
Restored /builds/**/B.csproj (in 14.13 min).
Restored /builds/**/C.csproj (in 2.03 min).
Restored /builds/**/D.csproj (in 36.08 min).

With dotnet/sdk:7.0, the project is restored in 45 seconds.

See below the logs if I add the option -v diagnostic. The process is extermely long during this kind of operations.

Verbose Logs

...
Acquiring lock for the installation of System.Diagnostics.Tracing 4.3.0 (TaskId:91)
Acquired lock for the installation of System.Diagnostics.Tracing 4.3.0 (TaskId:91)
Installed Xamarin.UITest 4.2.0 from https://my.artifactory/artifactory/api/nuget/nuget/ with content hash <xxx>. (TaskId:91)
Acquiring lock for the installation of System.Globalization 4.3.0 (TaskId:91)
Acquired lock for the installation of System.Globalization 4.3.0 (TaskId:91)
Assembly loaded during TaskRun (NuGet.Build.Tasks.RestoreTask): System.Diagnostics.StackTrace, Version=8.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a (location: /usr/share/dotnet/shared/Microsoft.NETCore.App/8.0.0/System.Diagnostics.StackTrace.dll, MVID: 9f2e188b-5afc-4c69-84d1-4627254ba14f, AppDomain: [Default]) (TaskId:91)
...
zivkan commented 9 months ago

@mcanzerini I'm sorry the experience has gotten so much worse for you.

You only provided a small snippet of the verbose logs, so I can't tell if there are any HTTP communication issues. We had another customer report that their nexus proxy doesn't handle nuget.org's vulnerability info resource, causing HTTP timeouts because nuget.org is blocked, so all NuGet's requests and retries time out, causing long delays. You appear to be using Artifactory, but I don't know how it works, or if it has a similar Vulnerability Info resource issue. You could try disabling NuGet Audit to see if that resolves the slowdown, but without more detailed logs, I can't say whether this appear to be the problem or not. Although the 100% CPU usage makes it sound unlikely.

It is probably due to limited resources available for the gitlab runner.

How resource limited are the CI agents? How many CPU cores and how much RAM?

One new feature in .NET 8 is NuGet Audit, so you could try disabling to see if that helps. Audit needs to load a vulnerabilities database. I'm not sure how big the in-memory size is, but the JSON file is 275KB. So, unless your CI agent is extremely RAM limited, it doesn't seem likely to me that Audit is exhausting all RAM, causing the OS to do a lot of disk swapping.

The ops in charge of the runner tells my that the CPU usage was above 100% during the long restore.

I consider high CPU usage to be normal. Downloading packages is only a small part of what restore does, and almost everything else is CPU limited.

If you're able to use dotnet-trace to collect performance information about both the .NET 7 and .NET 8 restores, we can compare to get the best idea about why there's a difference.

Otherwise, I assume your solution uses private packages on your artifactory feed. Are you able to create a sample solution that only uses nuget.org, that also has the same performance problem? If you can zip up that sample and send it to us, that will make it much easier to investigate.

mcanzerini commented 9 months ago

Thanks for your response.

Even if I did not have logs about timeout during vulnerability audit (unlike the issue author you mentionned), I have disabled the NuGet Audit in all my .csproj with <NuGetAudit>false</NuGetAudit> and this has no effect on the bad performance of the dotnet restore.

I will consider to try dotnet-trace to give you some others insights.

To my mind, like you guess, this is not an issue about memory.

In the issue you mentionned, the author still has the problem without the vulnerability audit. So I presumed our problems are similar.

Thanks for your time

n00j commented 8 months ago

We are seeing the same thing as well. Inside wsl -> a .net restore + build takes 20 seconds Inside docker in wsl -> 20+ minutes

In our nuget.config we just point to our internal nuget proxy.

<?xml version="1.0" encoding="utf-8"?>
<configuration>
  <packageSources>
    <add key="BBNugetV3" value="<url_to_internal_nuget_proxy" protocolVersion="3" />
  </packageSources>
</configuration>
zivkan commented 8 months ago

@n00j Can you create a sample solution that only uses nuget.org as a package source, and reproduce the problem? Can use capture a trace with dotnet-trace (ideally both with the .NET 7 and .NET 8 SDKs, so we can compare), and share the traces with us?

n00j commented 8 months ago

@n00j Can you create a sample solution that only uses nuget.org as a package source, and reproduce the problem? Can use capture a trace with dotnet-trace (ideally both with the .NET 7 and .NET 8 SDKs, so we can compare), and share the traces with us?

I'll do this with .NET 6 (we only use LTS releases internally) and .NET 8.

n00j commented 8 months ago

.NET 6 Example Project: example_net6.zip Trace: dotnet6_trace.zip

.NET 8 Example Project: example_net8.zip Trace (have to upload in multiple parts due to 25mb limit): dotnet8_trace_1.zip -> rename to dotnet8_trace.zip.001 dotnet8_trace_2.zip -> rename to dotnet8_trace.zip.002 then extract

The restore was run inside our isolated network which only has access to our nuget proxy server. I could not reproduce the slowness with a restore that had full access to the nuget.org upstream. Maybe there are some additional network requests that are being performed in .NET 8 nuget restore that needs access to nuget.org?

edit: I ran the same restore inside the .net 7 container (and changed the csproj to use net7.0 target) and that restore finished in 45 seconds.

edit2: I do see this in the .net 8 logs, which I don't think i saw in the other versions

         Resolving conflicts for net8.0...
         Acquiring lock for the installation of System.Text.Encoding 4.3.0
         Acquired lock for the installation of System.Text.Encoding 4.3.0
         Acquiring lock for the installation of System.Threading.Tasks 4.3.0
         Acquired lock for the installation of System.Threading.Tasks 4.3.0
         Acquiring lock for the installation of System.Runtime.Handles 4.3.0
         Acquired lock for the installation of System.Runtime.Handles 4.3.0
         Acquiring lock for the installation of System.Reflection.Emit.Lightweight 4.3.0
         Acquired lock for the installation of System.Reflection.Emit.Lightweight 4.3.0
         Acquiring lock for the installation of System.IO 4.3.0
         Acquired lock for the installation of System.IO 4.3.0
         Acquiring lock for the installation of System.Reflection.Emit.ILGeneration 4.3.0
         Acquired lock for the installation of System.Reflection.Emit.ILGeneration 4.3.0
         Acquiring lock for the installation of System.Reflection.Emit 4.3.0
         Acquired lock for the installation of System.Reflection.Emit 4.3.0
         Acquiring lock for the installation of Microsoft.NETCore.Targets 1.1.0
         Acquired lock for the installation of Microsoft.NETCore.Targets 1.1.0
         Acquiring lock for the installation of Microsoft.NETCore.Platforms 1.1.0
         Acquired lock for the installation of Microsoft.NETCore.Platforms 1.1.0
         Acquiring lock for the installation of System.Threading 4.3.0
         Acquired lock for the installation of System.Threading 4.3.0
         Acquiring lock for the installation of System.Runtime.InteropServices 4.3.0
         Acquired lock for the installation of System.Runtime.InteropServices 4.3.0
         Acquiring lock for the installation of System.Runtime.Extensions 4.3.0
         Acquired lock for the installation of System.Runtime.Extensions 4.3.0
         Acquiring lock for the installation of System.Runtime 4.3.0
         Acquired lock for the installation of System.Runtime 4.3.0
         Acquiring lock for the installation of System.Resources.ResourceManager 4.3.0
         Acquired lock for the installation of System.Resources.ResourceManager 4.3.0
         Acquiring lock for the installation of System.Reflection.TypeExtensions 4.3.0
         Acquired lock for the installation of System.Reflection.TypeExtensions 4.3.0
         Acquiring lock for the installation of System.Reflection.Primitives 4.3.0
         Acquired lock for the installation of System.Reflection.Primitives 4.3.0

edit3: If i set http_proxy, https_proxy, no_proxy so that the restore could reach the internet if it was trying to reach out, the restore finishes very quickly (under a minute) in .NET 8 while still only using our internal nuget proxy. So based on this, I think something is trying to reach out during the restore and is potentially timing out and eventually restoring successfully (not a hard failure).

edit4: Ran the restore inside the docker container with only access to our internal nuget proxy, and with that as the only source in the nuget.config. I also installed netstat and ran netstat -c while running the restore. Is edgecastcdn used by nuget.org? I might be just going down an unrelated rabbit hole at this point 🤷.

Proto RefCnt Flags       Type       State         I-Node   Path
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp6       0      1 198.18.0.2:56156        152.195.50.149:80       SYN_SENT
tcp6       0      1 198.18.0.2:59784        192.229.211.108:80      SYN_SENT
tcp6       0      1 198.18.0.2:56132        152.195.50.149:80       SYN_SENT
tcp6       0      1 198.18.0.2:56098        152.195.50.149:80       SYN_SENT
tcp6       0      1 198.18.0.2:56170        152.195.50.149:80       SYN_SENT
tcp6       0      1 198.18.0.2:56186        152.195.50.149:80       SYN_SENT
tcp6       0      1 198.18.0.2:56084        152.195.50.149:80       SYN_SENT
tcp6       0      1 198.18.0.2:56114        152.195.50.149:80       SYN_SENT
tcp6       0      1 198.18.0.2:56126        152.195.50.149:80       SYN_SENT
tcp6       0      1 198.18.0.2:56192        152.195.50.149:80       SYN_SENT
tcp6       0      1 198.18.0.2:56148        152.195.50.149:80       SYN_SENT

152.195.50.149 -> ns1.edgecastcdn.net

dig -x 152.195.50.149                                                                                                                     in bash at 19:47:16

; <<>> DiG 9.16.1-Ubuntu <<>> -x 152.195.50.149
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 30101
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 806f620267359ac101000000656feff03b5315f3cc7d5111 (good)
;; QUESTION SECTION:
;149.50.195.152.in-addr.arpa.   IN      PTR

;; AUTHORITY SECTION:
50.195.152.in-addr.arpa. 159    IN      SOA     ns1.edgecastcdn.net. noc.edgecast.com. 1589352535 3600 600 604800 600

;; Query time: 169 msec
;; SERVER: 172.26.240.1#53(172.26.240.1)
;; WHEN: Tue Dec 05 19:52:16 PST 2023
;; MSG SIZE  rcvd: 155

192.229.211.108 -> ns1.edgecastcdn.net

dig -x 192.229.211.108                                                                                                                    in bash at 19:52:16

; <<>> DiG 9.16.1-Ubuntu <<>> -x 192.229.211.108
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 45460
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 201d06c7642234f501000000656ff0196f990f200bf22e34 (good)
;; QUESTION SECTION:
;108.211.229.192.in-addr.arpa.  IN      PTR

;; AUTHORITY SECTION:
211.229.192.in-addr.arpa. 93    IN      SOA     ns1.edgecastcdn.net. noc.edgecast.com. 1589336042 3600 600 604800 600

;; Query time: 169 msec
;; SERVER: 172.26.240.1#53(172.26.240.1)
;; WHEN: Tue Dec 05 19:52:57 PST 2023
;; MSG SIZE  rcvd: 156

I also see requests to 20.42.73.138 and 40.78.229.32

zivkan commented 8 months ago

@n00j thank you for the traces! It was really helpful, and so often customers don't provide us with this kind of information to investigate. I really appreciate that you captured and shared it.

It appears the issue is that NuGet enabled signed package verification on Linux by default in .NET 8, whereas it was off by default (so, opt-in) prior. I wasn't aware of this change. I actually thought it was on by default from 6.0.4xx, but the docs say otherwise: https://learn.microsoft.com/en-us/dotnet/core/tools/nuget-signed-package-verification#linux

You can turn signed package verification off completely by setting the environment variable DOTNET_NUGET_SIGNATURE_VERIFICATION to false, as the docs say. Alternatively you can try setting the environment variable NUGET_CERT_REVOCATION_MODE to offline, which should still do signature checks for the signatures saved inside the package, but it won't reach out to Certificate Revocation Lists to see if the certificate was valid at the time of the signature.

Both of these will reduce your supply chain security, so it's a risk decision you, or your security team, need to make. In order to keep all the signature verification, your network needs to allow outgoing requests to effectively any IP address. In practise it's a small number of IP addresses. Unfortunately I don't know a way to find the URLs. dotnet nuget verify -v d path/to/package.nupkg will list all the author and repository signatures, and their certificate chains. However, it doesn't appear to list the certificate revocation list URL (not all certificates have a CRL).

Also, I'm not a security expert, but I don't understand why your company's firewall would be configured to drop outgoing traffic, rather than "reject". At least if the firewall rejects, then apps like NuGet will get quick feedback that the TCP connection isn't going to work, rather than needing to time out. If you can influence this change, then non-fatal errors like this one wouldn't have affected you (or your CI duration).

zivkan commented 8 months ago

Maybe there are some additional network requests that are being performed in .NET 8 nuget restore that needs access to nuget.org?

As mentioned in my previous message, it turns out that NuGet is making additional requests that aren't default in the .NET 7 or .NET 6 SDKs, but it's not to nuget.org, it's to whatever URL that code signing certificates specify for their Certificate Revocation List (CRL).

I do see this in the .net 8 logs, which I don't think i saw in the other versions

This lock messages have been in NuGet for 6 years, without changing verbosity, so they should have been there earlier: https://github.com/NuGet/NuGet.Client/blame/a59e64507383b64bcfbe9bf63b34aca946ab0da9/src/NuGet.Core/NuGet.Packaging/PackageExtractor.cs#L399

Is edgecastcdn used by nuget.org?

Yes. nuget.org has used Verizon's CDN for many years (and more recently Windows Azure CDN in addition), and Verizon's CDN appears to have been renamed edgecast. If your company firewall, or proxy, is trying to create an allow-list of hostnames, it will be difficult, indeed.

On my (WSL2) machine:

$ host api.nuget.org
api.nuget.org is an alias for nugetapiprod.trafficmanager.net.
nugetapiprod.trafficmanager.net is an alias for apiprod-mscdn.azureedge.net.
apiprod-mscdn.azureedge.net is an alias for apiprod-mscdn.afd.azureedge.net.
apiprod-mscdn.afd.azureedge.net is an alias for star-azureedge-prod.trafficmanager.net.
star-azureedge-prod.trafficmanager.net is an alias for shed.dual-low.part-0014.t-0009.t-msedge.net.
shed.dual-low.part-0014.t-0009.t-msedge.net is an alias for part-0014.t-0009.t-msedge.net.
part-0014.t-0009.t-msedge.net has address 13.107.246.42
part-0014.t-0009.t-msedge.net has address 13.107.213.42
part-0014.t-0009.t-msedge.net has IPv6 address 2620:1ec:bdf::42
part-0014.t-0009.t-msedge.net has IPv6 address 2620:1ec:46::42

So, my physical location is using Azure CDN for nuget.org, rather than Edgecast/Verizon, but you can see that NuGet is ultimately going to make HTTP requests to part-0014.t-0009.t-msedge.net for nuget.org.

n00j commented 8 months ago

@n00j thank you for the traces! It was really helpful, and so often customers don't provide us with this kind of information to investigate. I really appreciate that you captured and shared it.

It appears the issue is that NuGet enabled signed package verification on Linux by default in .NET 8, whereas it was off by default (so, opt-in) prior. I wasn't aware of this change. I actually thought it was on by default from 6.0.4xx, but the docs say otherwise: https://learn.microsoft.com/en-us/dotnet/core/tools/nuget-signed-package-verification#linux

You can turn signed package verification off completely by setting the environment variable DOTNET_NUGET_SIGNATURE_VERIFICATION to false, as the docs say. Alternatively you can try setting the environment variable NUGET_CERT_REVOCATION_MODE to offline, which should still do signature checks for the signatures saved inside the package, but it won't reach out to Certificate Revocation Lists to see if the certificate was valid at the time of the signature.

Thank you! I believe this was it. Setting DOTNET_NUGET_SIGNATURE_VERIFICATION to false reduced the restore time down to under a minute.

Hopefully this also will solve @mcanzerini issue

Both of these will reduce your supply chain security, so it's a risk decision you, or your security team, need to make. In order to keep all the signature verification, your network needs to allow outgoing requests to effectively any IP address. In practise it's a small number of IP addresses. Unfortunately I don't know a way to find the URLs. dotnet nuget verify -v d path/to/package.nupkg will list all the author and repository signatures, and their certificate chains. However, it doesn't appear to list the certificate revocation list URL (not all certificates have a CRL).

Yup, I understand. Unfortunately, in our official build environment it doesn't not have external internet access, so we will need to accept this additional risk.

Also, I'm not a security expert, but I don't understand why your company's firewall would be configured to drop outgoing traffic, rather than "reject". At least if the firewall rejects, then apps like NuGet will get quick feedback that the TCP connection isn't going to work, rather than needing to time out. If you can influence this change, then non-fatal errors like this one wouldn't have affected you (or your CI duration).

I'm not exactly sure why its done this way, it just seems like the firewall is a black hole. Thanks for the information, appreciate the very quick response!

mcanzerini commented 8 months ago

Thanks a lot for your work @n00j and @zivkan! I will try today the suggested variable

mcanzerini commented 8 months ago

I can confirm setting DOTNET_NUGET_SIGNATURE_VERIFICATION to false fixes the long restore. I close the issue. Thanks for the support guys.

chau2603 commented 7 months ago

For anyone that came later, add this line in dockerfile: ENV DOTNET_NUGET_SIGNATURE_VERIFICATION=false

sreddy1607 commented 2 months ago

Hi , when i am trying to deploy msbuild artifacts into nexus 3 , i am getting some ssl certificate issues for nuget commmands . Any leads ?

sreddy1607 commented 2 months ago

An error was encountered when fetching 'PUT https://nexusrepo-tools.apps.bld.cammis.medi-cal.ca.gov/repository/nuget-hosted/'. The request will now be retried. The SSL connection could not be established, see inner exception. Authentication failed, see inner exception. Ssl error:1000007d:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED at /builddir/build/BUILD/mono-6.12.0.107/external/boringssl/ssl/handshake_client.c:1132 PUT https://nexusrepo-tools.apps.bld.cammis.medi-cal.ca.gov/repository/nuget-hosted/ An error was encountered when fetching 'PUT https://nexusrepo-tools.apps.bld.cammis.medi-cal.ca.gov/repository/nuget-hosted/'. The request will now be retried. The SSL connection could not be established, see inner exception. Authentication failed, see inner exception. Ssl error:1000007d:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED at /builddir/build/BUILD/mono-6.12.0.107/external/boringssl/ssl/handshake_client.c:1132 PUT https://nexusrepo-tools.apps.bld.cammis.medi-cal.ca.gov/repository/nuget-hosted/ The SSL connection could not be established, see inner exception. Authentication failed, see inner exception. Ssl error:1000007d:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED at /builddir/build/BUILD/mono-6.12.0.107/external/boringssl/ssl/handshake_client.c:1132

halilbulentorhon commented 1 month ago

I had the same problem this solved my problem Dockerfile:

FROM --platform=$BUILDPLATFORM mcr.microsoft.com/dotnet/sdk:8.0 AS build
ARG TARGETARCH
WORKDIR /app

COPY *.csproj ./
ENV DOTNET_NUGET_SIGNATURE_VERIFICATION=false
RUN dotnet restore -a $TARGETARCH

COPY . .
RUN dotnet publish -c Release -o out -a $TARGETARCH

FROM mcr.microsoft.com/dotnet/aspnet:8.0 AS runtime
WORKDIR /app
COPY --from=build /app/out ./

EXPOSE 5241
ENTRYPOINT ["dotnet", "AuthAPI.dll"]

To get a build you can just use the --platform argument in docker build. To build an image for linux/amd64 just use the following build command:

docker build --platform linux/amd64 .
aurora commented 2 weeks ago

In my case it is not a big win - from 3,5 minutes to 2,5 - I also wonder if it's really a good idea to turn off such a security function 🤔