Open marcwittke opened 3 years ago
I'm experiencing this on a self-hosted Azure DevOps BuildAgent which fails randomly on dotnet
commands on .NET5.0:
[4562346.461844] .NET ThreadPool[870598]: segfault at 18 ip 00007f813e20c892 sp 00007f81281e5000 error 4 in libpthread-2.27.so[7f813e200000+1a000]
[4586429.064024] .NET ThreadPool[1032434]: segfault at 18 ip 00007f6a7b94f892 sp 00007f69ca7f8ba0 error 4 in libpthread-2.27.so[7f6a7b943000+1a000]
[4588177.547456] .NET ThreadPool[1063988]: segfault at 18 ip 00007f06d8288892 sp 00007f062cfaf9e0 error 4 in libpthread-2.27.so[7f06d827c000+1a000]
Dotnet get's installed on the agent by using the installer task:
2021-01-26T15:08:21.4116924Z Version 5.0.100 in Kanal "5.0" für benutzerseitig angegebene Versionsspezifikation gefunden: 5.0.100
2021-01-26T15:08:21.5900281Z URL zum Herunterladen von .NET Core sdk, Version 5.0.100 wird abgerufen.
2021-01-26T15:08:21.5937280Z Die Betriebssystemplattform wird ermittelt, um das richtige Downloadpaket für das Betriebssystem zu finden.
2021-01-26T15:08:21.5958925Z [command]/azp/agent/_work/_tasks/UseDotNet_b0ce7256-7898-45d3-9cb5-176b752bfea6/2.169.2/externals/get-os-distro.sh
2021-01-26T15:08:21.5960531Z Primary:linux-x64
2021-01-26T15:08:21.5961709Z Legacy:ubuntu.18.04-x64
2021-01-26T15:08:21.5963010Z Erkannte Plattform (primär): linux-x64
2021-01-26T15:08:21.5964368Z Erkannte Plattform (Legacy): ubuntu.18.04-x64
2021-01-26T15:08:21.5967575Z Version 5.0.100 wurde im Cache gefunden.
2021-01-26T15:08:21.5981248Z Der globale Toolpfad wird erstellt und PATH vorangestellt.
dotnet --info is not available, as there is no runtime nor SDK installed. We're using the dotnet tool installer during build:
Is a bit odd. @marcwittke could you run dotnet --info
as part of the build after SDK is installed on the build agnet?
@vitek-karas does it ring a bell?
sure:
.NET Core SDK (reflecting any global.json):
Version: 3.1.405
Commit: 65f9d75b1c
Runtime Environment:
OS Name: ubuntu
OS Version: 18.04
OS Platform: Linux
RID: ubuntu.18.04-x64
Base Path: /home/agent/agent/_work/_tool/dotnet/sdk/3.1.405/
Host (useful for support):
Version: 3.1.11
Commit: f5eceb8105
.NET Core SDKs installed:
2.1.805 [/home/agent/agent/_work/_tool/dotnet/sdk]
3.1.100 [/home/agent/agent/_work/_tool/dotnet/sdk]
3.1.404 [/home/agent/agent/_work/_tool/dotnet/sdk]
3.1.405 [/home/agent/agent/_work/_tool/dotnet/sdk]
.NET Core runtimes installed:
Microsoft.AspNetCore.All 2.1.17 [/home/agent/agent/_work/_tool/dotnet/shared/Microsoft.AspNetCore.All]
Microsoft.AspNetCore.App 2.1.17 [/home/agent/agent/_work/_tool/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.1.0 [/home/agent/agent/_work/_tool/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.1.10 [/home/agent/agent/_work/_tool/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.1.11 [/home/agent/agent/_work/_tool/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.NETCore.App 2.1.17 [/home/agent/agent/_work/_tool/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 3.1.0 [/home/agent/agent/_work/_tool/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 3.1.10 [/home/agent/agent/_work/_tool/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 3.1.11 [/home/agent/agent/_work/_tool/dotnet/shared/Microsoft.NETCore.App]
well, a cleanup wouldn't be bad... Is it safe to delete the _tool folder?
@wli3 Nope - I don't remember anything like this. Maybe @janvorli would know - or at least who to send this to. Crash dump would be ideal, but I don't know how to get one on linux in an automated job.
I'm experiencing this on a self-hosted Azure DevOps BuildAgent which fails randomly on dotnet commands on .net core 3.1 projects
/usr/bin/dotnet build /azp/agent/_work/1/s/src/SFA.DAS.EpaoRegister.UnitTests/SFA.DAS.EpaoRegister.UnitTests.csproj -dl:CentralLogger,"/azp/agent/_work/_tasks/DotNetCoreCLI_5541a522-603c-47ad-91fc-a4b1d163081b/2.181.0/dotnet-build-helpers/Microsoft.TeamFoundation.DistributedTask.MSBuild.Logger.dll"*ForwardingLogger,"/azp/agent/_work/_tasks/DotNetCoreCLI_5541a522-603c-47ad-91fc-a4b1d163081b/2.181.0/dotnet-build-helpers/Microsoft.TeamFoundation.DistributedTask.MSBuild.Logger.dll" --configuration release --no-restore
Microsoft (R) Build Engine version 16.7.2+b60ddb6f4 for .NET
Copyright (C) Microsoft Corporation. All rights reserved.
SFA.DAS.SharedOuterApi -> /azp/agent/_work/1/s/src/SFA.DAS.SharedOuterApi/bin/release/netcoreapp3.1/SFA.DAS.SharedOuterApi.dll
SFA.DAS.EpaoRegister -> /azp/agent/_work/1/s/src/SFA.DAS.EpaoRegister/bin/release/netcoreapp3.1/SFA.DAS.EpaoRegister.dll
SFA.DAS.EpaoRegister.UnitTests -> /azp/agent/_work/1/s/src/SFA.DAS.EpaoRegister.UnitTests/bin/release/netcoreapp3.1/SFA.DAS.EpaoRegister.UnitTests.dll
Build succeeded.
0 Warning(s)
0 Error(s)
Time Elapsed 00:00:01.15
/usr/bin/dotnet build /azp/agent/_work/1/s/src/SFA.DAS.EpaoRegister/SFA.DAS.EpaoRegister.csproj -dl:CentralLogger,"/azp/agent/_work/_tasks/DotNetCoreCLI_5541a522-603c-47ad-91fc-a4b1d163081b/2.181.0/dotnet-build-helpers/Microsoft.TeamFoundation.DistributedTask.MSBuild.Logger.dll"*ForwardingLogger,"/azp/agent/_work/_tasks/DotNetCoreCLI_5541a522-603c-47ad-91fc-a4b1d163081b/2.181.0/dotnet-build-helpers/Microsoft.TeamFoundation.DistributedTask.MSBuild.Logger.dll" --configuration release --no-restore
Microsoft (R) Build Engine version 16.7.2+b60ddb6f4 for .NET
Copyright (C) Microsoft Corporation. All rights reserved.
SFA.DAS.SharedOuterApi -> /azp/agent/_work/1/s/src/SFA.DAS.SharedOuterApi/bin/release/netcoreapp3.1/SFA.DAS.SharedOuterApi.dll
SFA.DAS.EpaoRegister -> /azp/agent/_work/1/s/src/SFA.DAS.EpaoRegister/bin/release/netcoreapp3.1/SFA.DAS.EpaoRegister.dll
Build succeeded.
0 Warning(s)
0 Error(s)
Time Elapsed 00:00:00.78
##[error]Error: The process '/usr/bin/dotnet' failed with exit code null
dotnet --info
root@azure-pipelines-build-agent-75ddfbcc4d-4ntn5:/azp# dotnet --info
.NET Core SDK (reflecting any global.json):
Version: 3.1.405
Commit: 3fae16e62e
Runtime Environment:
OS Name: ubuntu
OS Version: 18.04
OS Platform: Linux
RID: ubuntu.18.04-x64
Base Path: /usr/share/dotnet/sdk/3.1.405/
Host (useful for support):
Version: 3.1.11
Commit: f5eceb8105
.NET Core SDKs installed:
2.2.207 [/usr/share/dotnet/sdk]
3.1.405 [/usr/share/dotnet/sdk]
.NET Core runtimes installed:
Microsoft.AspNetCore.All 2.2.8 [/usr/share/dotnet/shared/Microsoft.AspNetCore.All]
Microsoft.AspNetCore.App 2.2.8 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.1.11 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.NETCore.App 2.2.8 [/usr/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 3.1.11 [/usr/share/dotnet/shared/Microsoft.NETCore.App]
The build succeeds but since the process is returning with exit code null the build process fails.
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Tagging subscribers to this area: @vitek-karas, @agocke See info in area-owners.md if you want to be subscribed.
Author: | marcwittke |
---|---|
Assignees: | - |
Labels: | `area-Host`, `untriaged` |
Milestone: | - |
AspNetCore is hitting an issue that looks very similar to this. We run some tests then call Environment.Exit(0);
and are hitting a segfault.
We have a crash dump at https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-aspnetcore-refs-heads-main-bd6750238a114336b0/Microsoft.AspNetCore.Localization.Tests--net6.0/core.1000.9653?sv=2019-07-07&se=2021-04-13T17%3A14%3A17Z&sr=c&sp=rl&sig=2cdAaIh4bXj5NtvyeG%2FSxJtayROazUADEGUgmDPsOJM%3D and https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-aspnetcore-refs-heads-main-5f556472c38b49a59c/Microsoft.AspNetCore.Mvc.Abstractions.Test--net6.0/core.1000.22884?sv=2019-07-07&se=2021-04-06T23%3A42%3A17Z&sr=c&sp=rl&sig=PiaeRVjWTySpvofLo3Yofn6EAf3RZMRV89VI1uoFXLA%3D
Both show the thread that segfaulted at an address that looks like it is in the address space of the libpthread-2.27.so module.
The dumps will be around for a week or 2.
I'll take a look at the dumps.
@BrennanConroy what is the distro that the dumps came from?
Helix queue ubuntu.1804.amd64.open
For the first link: Runtime 6.0.0-preview.3.21167.1 Sdk 6.0.100-preview.3.21168.19
What I can see in the dump is that the main thread has already exited and the crashing secondary thread is attempting to run some OpenSSL code and a lock address inside of libcrypto passed to CRYPTO_THREAD_write_lock is set to NULL. This sounds like the same issue as https://github.com/dotnet/runtime/issues/34231. Only that this time, it doesn't stem from the ERR_reason_error_string like in that issue, but from the following:
(lldb) clrstack -f
OS Thread Id: 0x25be (1)
Child SP IP Call Site
00007FA9E95108C0 00007FA9F1249892 libpthread.so.0!__pthread_rwlock_wrlock + 18
00007FA9E9510900 00007FA975A91989 libcrypto.so.1.1!CRYPTO_THREAD_write_lock + 9
00007FA9E9510910 00007FA975A53013 libcrypto.so.1.1!RAND_get_rand_method + 51
00007FA9E9510930 00007FA975A5333E libcrypto.so.1.1!RAND_priv_bytes + 14
00007FA9E9510950 00007FA9759759BD libcrypto.so.1.1!___lldb_unnamed_symbol375$$libcrypto.so.1.1 + 413
00007FA9E95109C0 00007FA975975B96 libcrypto.so.1.1!___lldb_unnamed_symbol376$$libcrypto.so.1.1 + 166
00007FA9E9510A10 00007FA975A0095B libcrypto.so.1.1!___lldb_unnamed_symbol984$$libcrypto.so.1.1 + 91
00007FA9E9510A50 00007FA9759BF41A libcrypto.so.1.1!___lldb_unnamed_symbol795$$libcrypto.so.1.1 + 906
00007FA9E9510AC0 00007FA9759BFD5D libcrypto.so.1.1!___lldb_unnamed_symbol796$$libcrypto.so.1.1 + 1229
00007FA9E9510BA0 00007FA9759BEDA4 libcrypto.so.1.1!EC_POINTs_mul + 324
00007FA9E9510C00 00007FA9759BEE10 libcrypto.so.1.1!EC_POINT_mul + 64
00007FA9E9510C40 00007FA9759C24DF libcrypto.so.1.1!___lldb_unnamed_symbol811$$libcrypto.so.1.1 + 175
00007FA9E9510CA0 00007FA9759BCD49 libcrypto.so.1.1!ECDH_compute_key + 89
00007FA9E9510D00 00007FA9759C18BC libcrypto.so.1.1!___lldb_unnamed_symbol802$$libcrypto.so.1.1 + 76
00007FA9E9510D20 00007FA9759C1A35 libcrypto.so.1.1!___lldb_unnamed_symbol803$$libcrypto.so.1.1 + 245
00007FA9E9510D80 00007FA975DA9317 libssl.so.1.1!___lldb_unnamed_symbol195$$libssl.so.1.1 + 343
00007FA9E9510DC0 00007FA975DCB304 libssl.so.1.1!___lldb_unnamed_symbol509$$libssl.so.1.1 + 1028
00007FA9E9510E10 00007FA975DC9157 libssl.so.1.1!___lldb_unnamed_symbol488$$libssl.so.1.1 + 1383
00007FA9E9510EE0 00007FA975DB54C4 libssl.so.1.1!SSL_do_handshake + 84
00007FA9E9510EE0 00007FA975DB54C4 libssl.so.1.1!SSL_do_handshake + 84
00007FA9E9510F20 00007FA97A6BB20E
00007FA9E9510F30 [InlinedCallFrame: 00007fa9e9510f30] System.Net.Security.dll!Interop+Ssl.SslDoHandshake(Microsoft.Win32.SafeHandles.SafeSslHandle)
00007FA9E9510F30 [InlinedCallFrame: 00007fa9e9510f30] System.Net.Security.dll!Interop+Ssl.SslDoHandshake(Microsoft.Win32.SafeHandles.SafeSslHandle)
00007FA9E9510F20 00007FA97A6BB20E System.Diagnostics.Process.dll!ILStubClass.IL_STUB_PInvoke(Microsoft.Win32.SafeHandles.SafeSslHandle) + 142
00007FA9E9510FC0 00007FA978D39EF2 System.Net.Security.dll!Interop+OpenSsl.DoSslHandshake(Microsoft.Win32.SafeHandles.SafeSslHandle, System.ReadOnlySpan`1<Byte>, Byte[] ByRef, Int32 ByRef) + 130
00007FA9E9511020 00007FA978D39168 System.Net.Security.dll!System.Net.Security.SslStreamPal.HandshakeInternal(System.Net.Security.SafeFreeCredentials, System.Net.Security.SafeDeleteSslContext ByRef, System.ReadOnlySpan`1<Byte>, Byte[] ByRef, System.Net.Security.SslAuthenticationOptions) + 168
00007FA9E95110D0 00007FA978D3791A System.Net.Security.dll!System.Net.Security.SecureChannel.GenerateToken(System.ReadOnlySpan`1<Byte>, Byte[] ByRef) + 138
00007FA9E9511140 00007FA978D3770E System.Net.Security.dll!System.Net.Security.SecureChannel.NextMessage(System.ReadOnlySpan`1<Byte>) + 62
00007FA9E9511190 00007FA978D3ABA7 System.Net.Security.dll!System.Net.Security.SslStream.ProcessBlob(Int32) + 327
00007FA9E9511200 00007FA978D63E66 System.Net.Security.dll!System.Net.Security.SslStream+<ReceiveBlobAsync>d__172`1[[System.Net.Security.AsyncReadWriteAdapter, System.Net.Security]].MoveNext() + 2230
00007FA9E95113D0 00007FA97A6BF2C0 System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.__Canon, System.Private.CoreLib],[System.Net.Security.SslStream+<ReceiveBlobAsync>d__172`1[[System.Net.Security.AsyncReadWriteAdapter, System.Net.Security]], System.Net.Security]].ExecutionContextCallback(System.Object) + 128 [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncTaskMethodBuilderT.cs @ 287]
00007FA9E9511410 00007FA97A6D2DF5 System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) + 149 [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs @ 208]
00007FA9E9511460 00007FA97A6BF0E0 System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.__Canon, System.Private.CoreLib],[System.Net.Security.SslStream+<ReceiveBlobAsync>d__172`1[[System.Net.Security.AsyncReadWriteAdapter, System.Net.Security]], System.Net.Security]].MoveNext(System.Threading.Thread) + 288 [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncTaskMethodBuilderT.cs @ 336]
00007FA9E95114E0 00007FA97A6BEF99 System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.__Canon, System.Private.CoreLib],[System.Net.Security.SslStream+<ReceiveBlobAsync>d__172`1[[System.Net.Security.AsyncReadWriteAdapter, System.Net.Security]], System.Net.Security]].MoveNext() + 25 [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncTaskMethodBuilderT.cs @ 302]
00007FA9E9511500 00007FA97A6D2FC6 System.Private.CoreLib.dll!System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Runtime.CompilerServices.IAsyncStateMachineBox, Boolean) + 214 [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/TaskContinuation.cs @ 805]
00007FA9E9511540 00007FA97A6D2554 System.Private.CoreLib.dll!System.Threading.Tasks.Task.RunContinuations(System.Object) + 212 [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs @ 3472]
00007FA9E95115F0 00007FA9763E4970 System.Private.CoreLib.dll!System.Threading.Tasks.Task`1[[System.Int32, System.Private.CoreLib]].TrySetResult(Int32) + 144 [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Future.cs @ 404]
00007FA9E9511620 00007FA9763E8BB6 System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.Int32, System.Private.CoreLib]].SetExistingTaskResult(System.Threading.Tasks.Task`1<Int32>, Int32) + 86 [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncTaskMethodBuilderT.cs @ 443]
00007FA9E9511650 00007FA9763E8D24 System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder`1[[System.Int32, System.Private.CoreLib]].SetResult(Int32) + 116 [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncValueTaskMethodBuilderT.cs @ 67]
00007FA9E9511680 00007FA978D68248 System.Net.Security.dll!System.Net.Security.SslStream+<<FillHandshakeBufferAsync>g__InternalFillHandshakeBufferAsync|181_0>d`1[[System.Net.Security.AsyncReadWriteAdapter, System.Net.Security]].MoveNext() + 488
00007FA9E9511730 00007FA97A6BEF5E System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Int32, System.Private.CoreLib],[System.Net.Security.SslStream+<<FillHandshakeBufferAsync>g__InternalFillHandshakeBufferAsync|181_0>d`1[[System.Net.Security.AsyncReadWriteAdapter, System.Net.Security]], System.Net.Security]].ExecutionContextCallback(System.Object) + 62 [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncTaskMethodBuilderT.cs @ 287]
00007FA9E9511750 00007FA97A6D2DF5 System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) + 149 [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs @ 208]
00007FA9E95117A0 00007FA97A6BEE29 System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Int32, System.Private.CoreLib],[System.Net.Security.SslStream+<<FillHandshakeBufferAsync>g__InternalFillHandshakeBufferAsync|181_0>d`1[[System.Net.Security.AsyncReadWriteAdapter, System.Net.Security]], System.Net.Security]].MoveNext(System.Threading.Thread) + 217 [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncTaskMethodBuilderT.cs @ 336]
00007FA9E95117F0 00007FA97A6BED29 System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Int32, System.Private.CoreLib],[System.Net.Security.SslStream+<<FillHandshakeBufferAsync>g__InternalFillHandshakeBufferAsync|181_0>d`1[[System.Net.Security.AsyncReadWriteAdapter, System.Net.Security]], System.Net.Security]].MoveNext() + 25 [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncTaskMethodBuilderT.cs @ 302]
00007FA9E9511810 00007FA9762BA852 System.Private.CoreLib.dll!System.Threading.ThreadPool+<>c.<.cctor>b__82_0(System.Object) + 34 [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs @ 1055]
00007FA9E9511820 00007FA97A943A29 System.Net.Sockets.dll!System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs.InvokeContinuation(System.Action`1<System.Object>, System.Object, Boolean, Boolean) + 361
00007FA9E9511870 00007FA97A9437E3 System.Net.Sockets.dll!System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs.OnCompleted(System.Net.Sockets.SocketAsyncEventArgs) + 179
00007FA9E95118D0 00007FA97A95A6C3 System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEventArgs.OnCompletedInternal() + 83
00007FA9E95118F0 00007FA97A9446BE System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEventArgs.FinishOperationAsyncSuccess(Int32, System.Net.Sockets.SocketFlags) + 46
00007FA9E9511910 00007FA97A945BB6 System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEventArgs.TransferCompletionCallbackCore(Int32, Byte[], Int32, System.Net.Sockets.SocketFlags, System.Net.Sockets.SocketError) + 54
00007FA9E9511940 00007FA97A945AF4 System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncContext+BufferMemoryReceiveOperation.InvokeCallback(Boolean) + 132
00007FA9E9511990 00007FA97A964B2B System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncContext+OperationQueue`1[[System.__Canon, System.Private.CoreLib]].ProcessAsyncOperation(System.__Canon) + 91
00007FA9E95119C0 00007FA97A945917 System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncContext+ReadOperation.System.Threading.IThreadPoolWorkItem.Execute() + 39
00007FA9E95119D0 00007FA97A944588 System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncContext.HandleEvents(SocketEvents) + 120
00007FA9E9511A00 00007FA97A9444B1 System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute() + 129
00007FA9E9511A40 00007FA97A6D6EAC System.Private.CoreLib.dll!System.Threading.ThreadPoolWorkQueue.Dispatch() + 364 [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs @ 769]
00007FA9E9511AC0 00007FA9762CF8C8 System.Private.CoreLib.dll!System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart() + 264 [/_/src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.WorkerThread.cs @ 58]
00007FA9E9511B80 00007FA9762B6028 System.Private.CoreLib.dll!System.Threading.Thread.StartCallback() + 104 [/_/src/coreclr/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs @ 105]
00007FA9E9511BA0 00007FA9EFF60487 libcoreclr.so!___lldb_unnamed_symbol9589$$libcoreclr.so + 124
00007FA9E9511BC0 00007FA9EFDBF1CE libcoreclr.so!___lldb_unnamed_symbol4452$$libcoreclr.so + 254
00007FA9E9511C50 00007FA9EFDD0372 libcoreclr.so!___lldb_unnamed_symbol4638$$libcoreclr.so + 146
00007FA9E9511CA0 00007FA9EFD8680A libcoreclr.so!___lldb_unnamed_symbol3792$$libcoreclr.so + 330
00007FA9E9511CF0 [DebuggerU2MCatchHandlerFrame: 00007fa9e9511cf0]
00007FA9E9511DC0 00007FA9EFD86E0D libcoreclr.so!___lldb_unnamed_symbol3793$$libcoreclr.so + 45
00007FA9E9511DF0 00007FA9EFDD044C libcoreclr.so!___lldb_unnamed_symbol4639$$libcoreclr.so + 188
00007FA9E9511E50 00007FA9F00F3B0E libcoreclr.so!___lldb_unnamed_symbol15450$$libcoreclr.so + 590
00007FA9E9511F00 00007FA9F12446DB libpthread.so.0!start_thread + 219
00007FA9E9511FC0 00007FA9F042A71F libc.so.6!__clone + 63
cc: @bartonjs
Given that Ubuntu 18.04 has explicitly removed support for NO_ATEXIT, I worry we'll end up just finding one intermittent problem after another. The previous fix assumed that everything other than the string table was graceful about post-exit calls, but apparently calls into the RNG hit a failure while trying to reinitialize it.
Feels like our choices are:
Is it feasible/useful to offer a change to OpenSSL?
Although perhaps this is a problem others might have to solve when interopping with a different native library that has similar expectations.
Is it feasible/useful to offer a change to OpenSSL?
OpenSSL supports the scenario, and we opt into it (OPENSSL_INIT_NO_ATEXIT):
The Ubuntu 18.04 build.... somewhere that I found before that I didn't write down and am having trouble finding again... explicitly removes support for that option.
Ah got it. And later versions - 20.04 etc?
Change the shim to guard every function with an if-shutting-down-exit while using something like interlocked increment/decrement to notify the atexit handler that we can release the library for further shutdown
This seems like the only reasonable possibility. It seems like the next critical thing to know is whether this also affects 20.04+. That would make it more important to fix since presumably 20.04 or later is an option for many 18.04 customers.
@bartonjs we know how to find that out? Here's what I have on my 20.04 machine with apt-get upgrade
run:
dan@LAPTOP-P6UJDVTA:/usr$ file ./lib/x86_64-linux-gnu/libcrypto.so.1.1
./lib/x86_64-linux-gnu/libcrypto.so.1.1: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=d30abd770d1215fff0f9a0fa9f12b1de5b50da29, stripped
dan@LAPTOP-P6UJDVTA:/usr$ file ./lib/x86_64-linux-gnu/libssl.so.1.1
./lib/x86_64-linux-gnu/libssl.so.1.1: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=4ef02cf97dd73cb0a88495e6dbf584dd6aa5aa22, stripped
From the above info I'm unsure how to determine.
Those SHA's aren't in the OpenSSL repo and it's not clear where in https://launchpad.net/ubuntu to find the sources Ubuntu used.
Anyway I don't know what to look for.
https://packages.ubuntu.com/source/focal/openssl says that Focal is based on OpenSSL 1.1.1f (plus servicing patches), and in 1.1.1f the source looked like
So if you do something like
$ lldb /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
(lldb) target create "/usr/lib/x86_64-linux-gnu/libcrypto.so.1.1"
Current executable set to '/usr/lib/x86_64-linux-gnu/libcrypto.so.1.1' (x86_64).
(lldb) dis -n OPENSSL_init_crypto
libcrypto.so.1.1`OPENSSL_init_crypto:
libcrypto.so.1.1[0x176bc0] <+0>: pushq %rbp
libcrypto.so.1.1[0x176bc1] <+1>: pushq %rbx
libcrypto.so.1.1[0x176bc2] <+2>: movq %rdi, %rbx
libcrypto.so.1.1[0x176bc5] <+5>: subq $0x8, %rsp
libcrypto.so.1.1[0x176bc9] <+9>: movl 0x352c59(%rip), %eax
libcrypto.so.1.1[0x176bcf] <+15>: testl %eax, %eax
libcrypto.so.1.1[0x176bd1] <+17>: je 0x176c18 ; <+88>
libcrypto.so.1.1[0x176bd3] <+19>: testl $0x40000, %edi ; imm = 0x40000
libcrypto.so.1.1[0x176bd9] <+25>: je 0x176bf0 ; <+48>
libcrypto.so.1.1[0x176bdb] <+27>: xorl %ebp, %ebp
libcrypto.so.1.1[0x176bdd] <+29>: addq $0x8, %rsp
libcrypto.so.1.1[0x176be1] <+33>: movl %ebp, %eax
libcrypto.so.1.1[0x176be3] <+35>: popq %rbx
libcrypto.so.1.1[0x176be4] <+36>: popq %rbp
libcrypto.so.1.1[0x176be5] <+37>: retq
libcrypto.so.1.1[0x176be6] <+38>: nopw %cs:(%rax,%rax)
libcrypto.so.1.1[0x176bf0] <+48>: leaq 0xc23cc(%rip), %rcx
libcrypto.so.1.1[0x176bf7] <+55>: movl $0x252, %r8d ; imm = 0x252
libcrypto.so.1.1[0x176bfd] <+61>: movl $0x46, %edx
libcrypto.so.1.1[0x176c02] <+66>: movl $0x74, %esi
libcrypto.so.1.1[0x176c07] <+71>: movl $0xf, %edi
libcrypto.so.1.1[0x176c0c] <+76>: callq 0x1580e0 ; ERR_put_error
libcrypto.so.1.1[0x176c11] <+81>: jmp 0x176bdb ; <+27>
libcrypto.so.1.1[0x176c13] <+83>: nopl (%rax,%rax)
libcrypto.so.1.1[0x176c18] <+88>: movq %rsi, %rbp
libcrypto.so.1.1[0x176c1b] <+91>: leaq 0x352bee(%rip), %rdi
libcrypto.so.1.1[0x176c22] <+98>: leaq -0x2e9(%rip), %rsi ; ___lldb_unnamed_symbol1395$$libcrypto.so.1.1
libcrypto.so.1.1[0x176c29] <+105>: callq 0x1df9f0 ; CRYPTO_THREAD_run_once
libcrypto.so.1.1[0x176c2e] <+110>: testl %eax, %eax
libcrypto.so.1.1[0x176c30] <+112>: je 0x176bdb ; <+27>
libcrypto.so.1.1[0x176c32] <+114>: movl 0x352bd0(%rip), %eax
libcrypto.so.1.1[0x176c38] <+120>: testl %eax, %eax
libcrypto.so.1.1[0x176c3a] <+122>: je 0x176bdb ; <+27>
libcrypto.so.1.1[0x176c3c] <+124>: testl $0x40000, %ebx ; imm = 0x40000
libcrypto.so.1.1[0x176c42] <+130>: je 0x176d70 ; <+432>
libcrypto.so.1.1[0x176c48] <+136>: testb $0x1, %bl
libcrypto.so.1.1[0x176c4b] <+139>: jne 0x176d10 ; <+336>
libcrypto.so.1.1[0x176c51] <+145>: testb $0x2, %bl
libcrypto.so.1.1[0x176c54] <+148>: jne 0x176d40 ; <+384>
libcrypto.so.1.1[0x176c5a] <+154>: testb $0x10, %bl
libcrypto.so.1.1[0x176c5d] <+157>: jne 0x176da0 ; <+480>
libcrypto.so.1.1[0x176c63] <+163>: testb $0x4, %bl
libcrypto.so.1.1[0x176c66] <+166>: jne 0x176dd0 ; <+528>
libcrypto.so.1.1[0x176c6c] <+172>: testb $0x20, %bl
libcrypto.so.1.1[0x176c6f] <+175>: jne 0x176e00 ; <+576>
libcrypto.so.1.1[0x176c75] <+181>: testb $0x8, %bl
libcrypto.so.1.1[0x176c78] <+184>: jne 0x176e2e ; <+622>
libcrypto.so.1.1[0x176c7e] <+190>: testl $0x20000, %ebx ; imm = 0x20000
(from Ubuntu 18.04)
hopefully there'll be something that looks like it's doing a test for 0x80000. If so, the problem is just gone on 20.04.
I've previously said that Ubuntu "removed" the support. Looking again, I don't see a patch that removes the support... but I also don't see one that adds it. The OPENSSL_INIT_NO_ATEXIT support was backported for OpenSSL 1.1.1b. It looks like Ubuntu 18.04 is 1.1.1 (RTM) plus servicing, and their servicing did something other than "catch up to 1.1.1-stable".
@bartonjs I see a test against 0x80000 ...
::kermit arms:: Yaaaaaaay!
Looks like we won't have a problem on 20.04. Hopefully that's enough to avoid needing to add our own locking/refcounting/whatever to literally every shim method.
@marcwittke is it possible for you to try on Ubuntu 20.04 or later? We think that will fix it. It is an issue in the libcrypto on 18.04.
I think so. We have two agents running right now on 18.04. I'll update one of them to 20.04 and let's see. Since it's intermitting, I think in a week I can give you a watermark whether it helped or not.
We've upgraded our build agents to Ubuntu 20.04 a week ago (after @danmoseley's reply) and so far we haven't experienced this error anymore.
Great, when @marcwittke can confirm also, we can close this.
seems to be fixed in Ubuntu 20.04, we had no segfaults any more on our upgraded build agent
So to be clear, 18.04 is still buggy and 20.04 is fixed?
Yeah. The problem is that the build of OpenSSL on Ubuntu 18.04 doesn't respect the OPENSSL_INIT_NO_ATEXIT flag, so it starts tearing down OpenSSL locks and statics when the main thread exits, but .NET Background Threads can still be calling into OpenSSL.
The 20.04 build has NO_ATEXIT support.
I'm not sure we'd arrived at consensus that we wouldn't take a fix here ...18.04 is supported until 2028 and we'll presumably support it in .NET 7. This is also causing our automated tests to crash periodically.
I'll leave this open for other customers to comment on the impact. But the recommendation above remains to move to 20.04 ID affected.
As seen in https://github.com/dotnet/sdk/pull/22872#issuecomment-988636583
This seems to be https://github.com/dotnet/runtime/issues/48411 which happens on 18.04 as seen here.
The stack:
00 00007f0e`d15f7c10 00007f0f`305b5959 libpthread_2_27!pthread_rwlock_wrlock+0x12 01 00007f0e`d15f7c50 00007f0f`30577013 libcrypto_so_1!CRYPTO_THREAD_write_lock+0x9 02 00007f0e`d15f7c60 00007f0f`305772f0 libcrypto_so_1!RAND_get_rand_method+0x33 03 00007f0e`d15f7c80 00007f0f`3053449f libcrypto_so_1!RAND_bytes+0x10 04 00007f0e`d15f7ca0 00007f0f`30542a97 libcrypto_so_1!EVP_MD_CTX_ctrl+0x132f 05 00007f0e`d15f7cd0 00007f0f`a31a5804 libcrypto_so_1!EVP_CIPHER_CTX_ctrl+0x17 06 00007f0e`d15f7ce0 00007f0f`a319761a libssl_so_1!SSL_in_before+0x13bd4 07 00007f0e`d15f7e30 00007f0f`a3192006 libssl_so_1!SSL_in_before+0x59ea 08 00007f0e`d15f7e40 00007f0f`a317e4e4 libssl_so_1!SSL_in_before+0x3d6 09 00007f0e`d15f7f10 00007f0f`334590a0 libssl_so_1!SSL_do_handshake+0x54 0a 00007f0e`d15f7f50 00007f0f`334571b6 Interop+Ssl.<SslDoHandshake>g____PInvoke__|26_0(IntPtr)+0x40 0b 00007f0e`d15f7ff0 00007f0f`3345bdec System_Net_Security!Interop+Ssl.SslDoHandshake(Microsoft.Win32.SafeHandles.SafeSslHandle)+0x56 [/_/src/libraries/System.Net.Security/src/Microsoft.Interop.DllImportGenerator/Microsoft.Interop.DllImportGenerator/GeneratedDllImports.g.cs @ 3487] 0c 00007f0e`d15f8030 00007f0f`3347a3e5 System_Net_Security!Interop+OpenSsl.DoSslHandshake(Microsoft.Win32.SafeHandles.SafeSslHandle, System.ReadOnlySpan`1<Byte>, Byte[] ByRef, Int32 ByRef)+0x8c [/_/src/libraries/Common/src/Interop/Unix/System.Security.Cryptography.Native/Interop.OpenSsl.cs @ 338] 0d 00007f0e`d15f80a0 00007f0f`33467c75 System_Net_Security!System.Net.Security.SslStreamPal.HandshakeInternal(System.Net.Security.SafeFreeCredentials, System.Net.Security.SafeDeleteSslContext ByRef, System.ReadOnlySpan`1<Byte>, Byte[] ByRef, System.Net.Security.SslAuthenticationOptions)+0xb5 [/_/src/libraries/System.Net.Security/src/System/Net/Security/SslStreamPal.Unix.cs @ 161] 0e 00007f0e`d15f8170 00007f0f`33467989 System_Net_Security!System.Net.Security.SecureChannel.GenerateToken(System.ReadOnlySpan`1<Byte>, Byte[] ByRef)+0x155 [/_/src/libraries/System.Net.Security/src/System/Net/Security/SecureChannel.cs @ 803] 0f 00007f0e`d15f8210 00007f0f`3346ced7 System_Net_Security!System.Net.Security.SecureChannel.NextMessage(System.ReadOnlySpan`1<Byte>)+0x39 [/_/src/libraries/System.Net.Security/src/System/Net/Security/SecureChannel.cs @ 725] 10 00007f0e`d15f8280 00007f0f`3347e742 System_Net_Security!System.Net.Security.SslStream.ProcessBlob(Int32)+0x157 [/_/src/libraries/System.Net.Security/src/System/Net/Security/SslStream.Implementation.cs @ 593] 11 00007f0e`d15f8310 00000000`00000000 System_Net_Security!System.Net.Security.SslStream+<ReceiveBlobAsync>d__174`1[[System.Net.Security.AsyncReadWriteAdapter, System.Net.Security]].MoveNext()+0x9a2 [/_/src/libraries/System.Net.Security/src/System/Net/Security/SslStream.Implementation.cs @ 555]
pthread_rwlock_wrlock has the following disassembly from entry to faulting point:
libpthread_2_27!pthread_rwlock_wrlock: 00007f0f`ac2fc880 4157 push r15 00007f0f`ac2fc882 4156 push r14 00007f0f`ac2fc884 4155 push r13 00007f0f`ac2fc886 4154 push r12 00007f0f`ac2fc888 55 push rbp 00007f0f`ac2fc889 53 push rbx 00007f0f`ac2fc88a 4889fb mov rbx,rdi 00007f0f`ac2fc88d 4883ec08 sub rsp,8 00007f0f`ac2fc891 90 nop 00007f0f`ac2fc892 8b5718 mov edx,dword ptr [rdi+18h]
The segv is from reading RDI + 0x18 = 0x18. RBX and RDX are indeed 0. RDI in SysV is the first parameter passed, which is
pthread_rwlock_t*
. That's passed in from https://github.com/openssl/openssl/blob/b1553c89285cb05a28d185423bc3df9b505db92a/crypto/threads_pthread.c#L75-L86; called from RAND_get_rand_method with a C-static lock,rand_meth_lock
, which doesn't support reinitialization in 18.04.
I'm not sure we'd arrived at consensus that we wouldn't take a fix here
The only complete fix we could take would be to run literally every shim function to OpenSSL under the same mutex we use for loading exception strings, to work around applications doing work on background threads after the main thread has exited (because these crashes are only after exit()
has been called / main()
has exited).
The biggest offender seems to be SSL_do_handshake; so we /might/ be able to start the game of whack-a-mole by making TLS handshakes mutexed; but I don't think that the networking team would like that. (We could probably change our mutex to a rwlock so we don't utterly kill parallelism with TLS handshakes, but it's still not free)
I've also not tried working with Canonical to get them to just patch in the support for OPENSSL_NO_ATEXIT. @richlander do you have any contacts there?
I do. Hey @wiswaud -- can you get us a contact at Canonical who can help us with some OpenSSL issues on Ubuntu 18.04?
I was given an official account Canonical account to report issues via their tracker. That was quick.
@bartonjs Can you write a succinct description of the issue that I can copy/paste into the Canonical tracker?
@richlander How's this?
Bionic's OpenSSL 1.1.1 package (https://launchpad.net/ubuntu/bionic/+source/openssl) is the only version of openssl 1.1.1 on any distro that we've encountered that does not have support for the OPENSSL_NO_ATEXIT functionality from 1.1.1b (https://github.com/openssl/openssl/commit/c2b3db245452f185948b4f767f7e1051b6bd59a7).
The threading model in .NET has the possibility that background threads are still running when exit()
is called, which can cause SIGSEGV if a background thread interacts with OpenSSL after/while it has unloaded. For that reason, we always initialize OpenSSL 1.1.1 with the OPENSSL_NO_ATEXIT flag (which, of all the distros we run on only has no effect on Bionic).
We feel that the stability of applications on Ubuntu 18.04 would be improved if the functionality of OPENSSL_NO_ATEXIT was merged into the bionic openssl 1.1.1 package, even if the constant isn't published into the header for the dev package.
Perfect! Thanks much.
I have been hitting this crash recently on my main devbox, which is Ubuntu 18.04. However, it started to happen relatively recently, at most a month ago build was stable. So maybe something has changed in the msbuild that makes this occur much more frequently or something like that.
If you have a good repro, that would be useful. I am taking with Canonical now.
Unfortunately I don't. It crashes on average once in a day or two when running ./build.sh script.
That's OK. Let's see if we can get a fix and then maybe deploy some early fixes.
Simple repro code for testing:
using System;
using System.Runtime.InteropServices;
using System.Security.Cryptography;
atexit(AtExitHandler);
byte[] data = new byte[] { 0, 1, 2, 3, 4, 5 };
byte[] hashValue;
using (SHA256 sha256 = SHA256.Create())
{
hashValue = sha256.ComputeHash(data);
}
Console.WriteLine($"hash: {ToHex(hashValue)}");
[DllImport("libc", EntryPoint = "__cxa_atexit", CallingConvention = CallingConvention.Cdecl)]
static extern int atexit(Action a);
static void AtExitHandler()
{
byte[] randomBytes = new byte[16];
RandomNumberGenerator.Fill(randomBytes);
Console.WriteLine($"random: {ToHex(randomBytes)}");
}
static string ToHex(byte[] bytes)
{
return string.Join("", bytes.Select((b) => b.ToString("X2")));
}
In case your mangled name of atexit
differs to get a correct one:
nm -D `ldd \`which echo\` | grep libc | cut '--delimiter= ' -f 3` | grep 'atexit\>' | cut '--delimiter= ' -f 3
# for me prints: __cxa_atexit
we should probably do .so
and wrap atexit
file since atexit
is only source compatible and in some places it's documented that it takes 2 extra args but it's meant to be a simple demonstration of issue... adjust as needed...
We are in the late stages of getting Canonical to publish a fix in Ubuntu 18.04 via their ESM program. I believe the easiest way to access that is via Ubuntu Pro.
The fix has been released in libssl package version 1.1.1-1ubuntu2.1~18.04.23+esm1
.
Here are my repro steps to acquire that package: https://gist.github.com/richlander/47333cbf90ee0ee3f51bcb0dbbb3a76f?permalink_comment_id=4676592#gistcomment-4676592
Now and then our build agent produces broken builds. The Error message reads:
##[error]Error: The process '/home/agent/agent/_work/_tool/dotnet/dotnet' failed with exit code null
The project is a dotnet core 3.1 web api solution with something like 30 projects, no unmanaged stuff at all.
root cause is a segfault as seen in dmesg
Environment info:
Build agents are equipped with 2vCPU and 2GB memory.
dotnet --info
is not available, as there is no runtime nor SDK installed. We're using the dotnet tool installer during build:I have no idea how to debug this. I'd like to provide more info, but need assistance to do so.