Open lewing opened 2 months ago
https://github.com/dotnet/aspnetcore/issues/57416 but for everybody
cc @jeffschwMSFT
We suspect https://github.com/dotnet/sdk/issues/42870 is another manifestation of the same problem
This seems to be an occurrence of System.Security.Cryptography.X509Certificates test failures (dotnet/runtime#70320) (also dotnet/runtime#20708, dotnet/runtime#74838, and dotnet/runtime#83226) which we encounter occasionally as a known build error. This is an external issue with Windows when the "CNG Key Isolation" windows service has crashed or has stopped.
The only known way to resolve this is to reboot the affected machine as the underlying service may die and fail to restart when the host is severely resource starved.
This seems to be an occurrence of System.Security.Cryptography.X509Certificates test failures (dotnet/runtime#70320) (also dotnet/runtime#20708, dotnet/runtime#74838, and dotnet/runtime#83226) which we encounter occasionally as a known build error. This is an external issue with Windows when the "CNG Key Isolation" windows service has crashed or has stopped.
The only known way to resolve this is to reboot the affected machine as the underlying service may die and fail to restart when the host is severely resource starved.
@jeffhandley so this is failing > 30x a day in CI right now, who and how can we do that? existing systems aren't working. There are multiple open issues across multiple repos and an active FR thread.
I opened 4 random logs out of curiosity and it seems they run on 3 different machines so it seems a bit unlikely it's machine specific - the only thing they had in common was that they used "Windows-10-10.0.20348-SP0" - possibly there is something special about that build? Possibly someone from infra team could check if there were any recent updates before it started happening more frequently? (i.e. Aug 13/14 update)
It's possible it's true that once we hit this it will fail until reboot but there is definitely something which makes it not work.
What I see in common about the stack trace is:
2024-08-22T23:00:15.1869805Z at System.Security.Cryptography.CngHelpers.OpenStorageProvider(CngProvider provider)
2024-08-22T23:00:15.1869835Z at System.Security.Cryptography.CngKey.Import(ReadOnlySpan`1 keyBlob, String curveName, CngKeyBlobFormat format, CngProvider provider)
2024-08-22T23:00:15.1869869Z at System.Security.Cryptography.RSACng.ImportKeyBlob(ReadOnlySpan`1 rsaBlob, Boolean includePrivate)
2024-08-22T23:00:15.1869901Z at System.Security.Cryptography.RSACng.ImportParameters(RSAParameters parameters)
2024-08-22T23:00:15.1869931Z at System.Security.Cryptography.X509Certificates.CertificatePal.CopyWithPrivateKey(RSA rsa)
2024-08-22T23:00:15.1870051Z at System.Security.Cryptography.X509Certificates.RSACertificateExtensions.CopyWithPrivateKey(X509Certificate2 certificate, RSA privateKey)
2024-08-22T23:00:15.1870085Z at System.Security.Cryptography.X509Certificates.CertificateRequest.CreateSelfSigned(DateTimeOffset notBefore, DateTimeOffset notAfter)
I'm also seeing some of those tests call CreateSelfSigned from static constructor so it's possible it's related (it might be worth trying to move that logic out inside ASP.NET and see if that changes anything - my theory is that something could got partially initialized).
so to me it sounds like one of the following is most likely:
Is it possible that we aren't disposing something (that CreateSelfSigned in the static constructor) and that thing accumulates resources in this external service? If it's happening more often we might be able to get a repro more easily.
When we've seen this in the past, it has meant that the "CNG Key Isolation Service" service has exited (crashed, probably). sc stop keyiso
will cause it to happen, so maybe sc start keyiso
will fix it... or, the easier answer is to reboot.
In the past I have tried quite a lot to cause one of these failures to happen using .NET, so I could see how we can stop encountering them. I've never been able to do so. The OS team responsible for it says they only ever see these errors on test machines, and that their recommended answer is to just pave the machine and watch the problem go away.
The best short-term fix I can think of is to just be aggressively rebooting the test machines... assuming they're not aggressively rebooted/reimaged/spun-up-from-nothing already. If I could remember the right way to suggest making that inquiry to the engineering first responders, I'd say so here...
Uptime for machines managed by Helix depends on the queue demand. The machines are kept alive until the load diminishes, then the scaleset scales down by removing the oldest machines first.
Out of curiosity, I pulled some data for the last 7 days for windows.amd64.vs2022.pre.open (chosen randomly) on machine uptime. I thought it would be interesting to drop here.
@garath Am I reading the data correctly that the longest that particular machine went between reboots was 177 minutes, with a p95 uptime less than 60 minutes on any given day this past week? If so, that does invalidate the assertion that rebooting will address it.
@bartonjs Are there folks you've talked to before from Windows that we could engage on this reoccurrence?
Am I reading the data correctly that the longest that particular machine
Not any single machine, this is the 50th, 75th and 95th percentile for the "uptime" of all machines ever allocated to that queue on a given day.
OK; thanks, @garath. I also checked the max in case there were machines outside the 95th percentile that could have had long uptimes and be the culprits. But the max I saw was 177 minutes; does that sound right to you that the longest any machine went without a reboot was 177 minutes?
Huh, I had it in my head that the average Helix machine had an uptime more like several days (and the lowest-indexed ones in the scale... cluster... thing... having an uptime of "since patches were last installed").
I've reached out to Windows to see if it's a known thing with known workarounds (or potentially a thing they'd want to look at).
had it in my head that the average Helix machine had an uptime more like several days
This could be true for the very, very busy queues. I also believe it was more true a few years ago before we created our own custom autoscaler service.
I'm happy to pull data if there are any questions. Just let me know what you'd like to see.
But the max I saw was 177 minutes; does that sound right to you that the longest any machine went without a reboot was 177 minutes?
Looks like that's exactly correct. Below are the top 10 longest uptimes for the last seven days in the Windows.Amd64.VS2022.Pre.Open queue.
Machine Name | Start | Finish | Uptime (min) |
---|---|---|---|
a000WZD | 2024-08-21T15:23:00.743Z | 2024-08-21T18:20:24.598Z | 177 |
a000Y60 | 2024-08-21T20:21:50.615Z | 2024-08-21T23:07:53.706Z | 166 |
a000VWL | 2024-08-21T06:19:29.636Z | 2024-08-21T08:54:44.321Z | 155 |
a0010PB | 2024-08-22T17:59:14.076Z | 2024-08-22T20:26:39.141Z | 147 |
a000XG7 | 2024-08-21T16:12:38.972Z | 2024-08-21T18:21:45.958Z | 129 |
a000YR8 | 2024-08-21T22:28:16.549Z | 2024-08-22T00:34:19.332Z | 126 |
a0016GG | 2024-08-26T12:45:37.372Z | 2024-08-26T14:46:43.211Z | 121 |
a0016R6 | 2024-08-26T14:16:10.458Z | 2024-08-26T16:17:39.64Z | 121 |
a0014XH | 2024-08-25T11:29:31.126Z | 2024-08-25T13:27:00.391Z | 117 |
a000WWF | 2024-08-21T12:58:14.531Z | 2024-08-21T14:55:30.927Z | 117 |
The exception on aspnetcore builds is slightly different, I doubt it will help but in case someone looking looking for it:
System.Security.Cryptography.CryptographicException: There are no more endpoints available from the endpoint mapper.
at System.Security.Cryptography.X509Certificates.X509CertificateLoader.ImportPfx(ReadOnlySpan`1 data, ReadOnlySpan`1 password, X509KeyStorageFlags keyStorageFlags)
at System.Security.Cryptography.X509Certificates.X509CertificateLoader.LoadPkcs12NoLimits(ReadOnlyMemory`1 data, ReadOnlySpan`1 password, X509KeyStorageFlags keyStorageFlags, Pkcs12Return& earlyReturn)
at System.Security.Cryptography.X509Certificates.X509CertificateLoader.LoadPkcs12(ReadOnlyMemory`1 data, ReadOnlySpan`1 password, X509KeyStorageFlags keyStorageFlags, Pkcs12LoaderLimits loaderLimits)
at System.Security.Cryptography.X509Certificates.X509CertificateLoader.LoadFromFile[T](String path, ReadOnlySpan`1 password, X509KeyStorageFlags keyStorageFlags, Pkcs12LoaderLimits loaderLimits, LoadFromFileFunc`1 loader)
at System.Security.Cryptography.X509Certificates.X509CertificateLoader.LoadPkcs12PalFromFile(String path, ReadOnlySpan`1 password, X509KeyStorageFlags keyStorageFlags, Pkcs12LoaderLimits loaderLimits)
at System.Security.Cryptography.X509Certificates.CertificatePal.FromBlobOrFile(ReadOnlySpan`1 rawData, String fileName, SafePasswordHandle password, X509KeyStorageFlags keyStorageFlags)
at System.Security.Cryptography.X509Certificates.X509Certificate..ctor(String fileName, String password, X509KeyStorageFlags keyStorageFlags)
at Microsoft.AspNetCore.InternalTesting.TestResources.GetTestCertificate(String certName) in /_/src/Shared/TestResources.cs:line 31
at Interop.FunctionalTests.H2SpecTests.<>c__DisplayClass0_0.<RunIndividualTestCase>b__2(ListenOptions listenOptions) in /_/src/Servers/Kestrel/test/Interop.FunctionalTests/H2SpecTests.cs:line 41
at Microsoft.AspNetCore.Server.Kestrel.Core.KestrelServerOptions.Listen(IPAddress address, Int32 port, Action`1 configure) in /_/src/Servers/Kestrel/Core/src/KestrelServerOptions.cs:line 485
at Interop.FunctionalTests.H2SpecTests.<>c__DisplayClass0_0.<RunIndividualTestCase>b__1(KestrelServerOptions options) in /_/src/Servers/Kestrel/test/Interop.FunctionalTests/H2SpecTests.cs:line 36
at Microsoft.Extensions.Options.OptionsFactory`1.Create(String name)
at Microsoft.Extensions.Options.UnnamedOptionsManager`1.get_Value()
at Microsoft.AspNetCore.Server.Kestrel.Core.KestrelServerImpl.CreateServiceContext(IOptions`1 options, ILoggerFactory loggerFactory, DiagnosticSource diagnosticSource, KestrelMetrics metrics) in /_/src/Servers/Kestrel/Core/src/Internal/KestrelServerImpl.cs:line 80
at Microsoft.AspNetCore.Server.Kestrel.Core.KestrelServerImpl..ctor(IOptions`1 options, IEnumerable`1 transportFactories, IEnumerable`1 multiplexedFactories, IHttpsConfigurationService httpsConfigurationService, ILoggerFactory loggerFactory, KestrelMetrics metrics) in
The active conversation is currently happening here: https://teams.microsoft.com/l/message/19:afba3d1545dd45d7b79f34c1821f6055@thread.skype/1726593921656?tenantId=72f988bf-86f1-41af-91ab-2d7cd011db47&groupId=4d73664c-9f2f-450d-82a5-c2f02756606d&parentMessageId=1723835371653&teamName=.NET%20Core%20Eng%20Services%20Partners&channelName=First%20Responders&createdTime=1726593921656
Build
https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=781694
Build leg reported
Microsoft.NET.Build.Containers.UnitTests.RegistryTests.InsecureRegistry
Pull Request
https://github.com/dotnet/sdk.git/pull/42858
Known issue core information
Fill out the known issue JSON section by following the step by step documentation on how to create a known issue
@dotnet/dnceng
Release Note Category
Release Note Description
Additional information about the issue reported
No response
Known issue validation
Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=781694 Error message validated:
[There are no more endpoints available from the endpoint mapper.
] Result validation: :white_check_mark: Known issue matched with the provided build. Validation performed at: 8/21/2024 1:53:12 AM UTCReport
Summary