dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.97k stars 4.66k forks source link

`dotnet build` intermittently crashes with segfault on Ubuntu 18.04 #48411

Open marcwittke opened 3 years ago

marcwittke commented 3 years ago

Now and then our build agent produces broken builds. The Error message reads: ##[error]Error: The process '/home/agent/agent/_work/_tool/dotnet/dotnet' failed with exit code null

The project is a dotnet core 3.1 web api solution with something like 30 projects, no unmanaged stuff at all.

root cause is a segfault as seen in dmesg

$ dmesg | grep dotnet

[17426.781072] dotnet[36429]: segfault at 18 ip 00007f9d65e87892 sp 00007f9d5e083bb0 error 4 in libpthread-2.27.so[7f9d65e7b000+1a000]
[1418646.055501] dotnet[36089]: segfault at 18 ip 00007f345cea9892 sp 00007f33b9703eb0 error 4 in libpthread-2.27.so[7f345ce9d000+1a000]
[2246615.917135] dotnet[87465]: segfault at 18 ip 00007fd998396382 sp 00007fd98fd373a0 error 4 in libpthread-2.27.so[7fd99838a000+1a000]
[2362725.938722] dotnet[21158]: segfault at 18 ip 00007fe8ee98a892 sp 00007fe8e637ee00 error 4 in libpthread-2.27.so[7fe8ee97e000+1a000]
[2432991.847286] dotnet[48481]: segfault at 18 ip 00007f7ac18e8892 sp 00007f7a46173b00 error 4 in libpthread-2.27.so[7f7ac18dc000+1a000]
[2704555.425939] dotnet[88757]: segfault at 18 ip 00007fe0bc6bb892 sp 00007fe0b48b4ae0 error 4 in libpthread-2.27.so[7fe0bc6af000+1a000]
[2846996.143322] dotnet[107654]: segfault at 18 ip 00007fad287ea892 sp 00007facad075b00 error 4 in libpthread-2.27.so[7fad287de000+1a000]
[2853616.129105] dotnet[15803]: segfault at 18 ip 00007f72657db892 sp 00007f725d1cfb00 error 4 in libpthread-2.27.so[7f72657cf000+1a000]
[3496394.984178] dotnet[59923]: segfault at 18 ip 00007f5d8ffe7892 sp 00007f5d889e1b00 error 4 in libpthread-2.27.so[7f5d8ffdb000+1a000]
[3630179.291391] dotnet[98248]: segfault at 18 ip 00007f8d8079a892 sp 00007f8d78993e00 error 4 in libpthread-2.27.so[7f8d8078e000+1a000]
[3633549.092183] dotnet[101217]: segfault at 18 ip 00007f617d49a892 sp 00007f60d9ce7e00 error 4 in libpthread-2.27.so[7f617d48e000+1a000]

Environment info:

NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Build agents are equipped with 2vCPU and 2GB memory.

dotnet --info is not available, as there is no runtime nor SDK installed. We're using the dotnet tool installer during build:

Tool to install: .NET Core sdk version 3.1.x.
Found version 3.1.405 in channel 3.1 for user specified version spec: 3.1.x
Version: 3.1.405 was found in cache.
Creating global tool path and pre-pending to PATH.

I have no idea how to debug this. I'd like to provide more info, but need assistance to do so.

NecatiMeral commented 3 years ago

I'm experiencing this on a self-hosted Azure DevOps BuildAgent which fails randomly on dotnet commands on .NET5.0:

[4562346.461844] .NET ThreadPool[870598]: segfault at 18 ip 00007f813e20c892 sp 00007f81281e5000 error 4 in libpthread-2.27.so[7f813e200000+1a000]
[4586429.064024] .NET ThreadPool[1032434]: segfault at 18 ip 00007f6a7b94f892 sp 00007f69ca7f8ba0 error 4 in libpthread-2.27.so[7f6a7b943000+1a000]
[4588177.547456] .NET ThreadPool[1063988]: segfault at 18 ip 00007f06d8288892 sp 00007f062cfaf9e0 error 4 in libpthread-2.27.so[7f06d827c000+1a000]

Dotnet get's installed on the agent by using the installer task:

2021-01-26T15:08:21.4116924Z Version 5.0.100 in Kanal "5.0" für benutzerseitig angegebene Versionsspezifikation gefunden: 5.0.100
2021-01-26T15:08:21.5900281Z URL zum Herunterladen von .NET Core sdk, Version 5.0.100 wird abgerufen.
2021-01-26T15:08:21.5937280Z Die Betriebssystemplattform wird ermittelt, um das richtige Downloadpaket für das Betriebssystem zu finden.
2021-01-26T15:08:21.5958925Z [command]/azp/agent/_work/_tasks/UseDotNet_b0ce7256-7898-45d3-9cb5-176b752bfea6/2.169.2/externals/get-os-distro.sh
2021-01-26T15:08:21.5960531Z Primary:linux-x64
2021-01-26T15:08:21.5961709Z Legacy:ubuntu.18.04-x64
2021-01-26T15:08:21.5963010Z Erkannte Plattform (primär): linux-x64
2021-01-26T15:08:21.5964368Z Erkannte Plattform (Legacy): ubuntu.18.04-x64
2021-01-26T15:08:21.5967575Z Version 5.0.100 wurde im Cache gefunden.
2021-01-26T15:08:21.5981248Z Der globale Toolpfad wird erstellt und PATH vorangestellt.
wli3 commented 3 years ago

dotnet --info is not available, as there is no runtime nor SDK installed. We're using the dotnet tool installer during build:

Is a bit odd. @marcwittke could you run dotnet --info as part of the build after SDK is installed on the build agnet?

wli3 commented 3 years ago

@vitek-karas does it ring a bell?

marcwittke commented 3 years ago

sure:

.NET Core SDK (reflecting any global.json):
 Version:   3.1.405
 Commit:    65f9d75b1c

Runtime Environment:
 OS Name:     ubuntu
 OS Version:  18.04
 OS Platform: Linux
 RID:         ubuntu.18.04-x64
 Base Path:   /home/agent/agent/_work/_tool/dotnet/sdk/3.1.405/

Host (useful for support):
  Version: 3.1.11
  Commit:  f5eceb8105

.NET Core SDKs installed:
  2.1.805 [/home/agent/agent/_work/_tool/dotnet/sdk]
  3.1.100 [/home/agent/agent/_work/_tool/dotnet/sdk]
  3.1.404 [/home/agent/agent/_work/_tool/dotnet/sdk]
  3.1.405 [/home/agent/agent/_work/_tool/dotnet/sdk]

.NET Core runtimes installed:
  Microsoft.AspNetCore.All 2.1.17 [/home/agent/agent/_work/_tool/dotnet/shared/Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.1.17 [/home/agent/agent/_work/_tool/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 3.1.0 [/home/agent/agent/_work/_tool/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 3.1.10 [/home/agent/agent/_work/_tool/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 3.1.11 [/home/agent/agent/_work/_tool/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 2.1.17 [/home/agent/agent/_work/_tool/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 3.1.0 [/home/agent/agent/_work/_tool/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 3.1.10 [/home/agent/agent/_work/_tool/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 3.1.11 [/home/agent/agent/_work/_tool/dotnet/shared/Microsoft.NETCore.App]

well, a cleanup wouldn't be bad... Is it safe to delete the _tool folder?

vitek-karas commented 3 years ago

@wli3 Nope - I don't remember anything like this. Maybe @janvorli would know - or at least who to send this to. Crash dump would be ideal, but I don't know how to get one on linux in an automated job.

adam230594 commented 3 years ago

I'm experiencing this on a self-hosted Azure DevOps BuildAgent which fails randomly on dotnet commands on .net core 3.1 projects

/usr/bin/dotnet build /azp/agent/_work/1/s/src/SFA.DAS.EpaoRegister.UnitTests/SFA.DAS.EpaoRegister.UnitTests.csproj -dl:CentralLogger,"/azp/agent/_work/_tasks/DotNetCoreCLI_5541a522-603c-47ad-91fc-a4b1d163081b/2.181.0/dotnet-build-helpers/Microsoft.TeamFoundation.DistributedTask.MSBuild.Logger.dll"*ForwardingLogger,"/azp/agent/_work/_tasks/DotNetCoreCLI_5541a522-603c-47ad-91fc-a4b1d163081b/2.181.0/dotnet-build-helpers/Microsoft.TeamFoundation.DistributedTask.MSBuild.Logger.dll" --configuration release --no-restore
Microsoft (R) Build Engine version 16.7.2+b60ddb6f4 for .NET
Copyright (C) Microsoft Corporation. All rights reserved.

  SFA.DAS.SharedOuterApi -> /azp/agent/_work/1/s/src/SFA.DAS.SharedOuterApi/bin/release/netcoreapp3.1/SFA.DAS.SharedOuterApi.dll
  SFA.DAS.EpaoRegister -> /azp/agent/_work/1/s/src/SFA.DAS.EpaoRegister/bin/release/netcoreapp3.1/SFA.DAS.EpaoRegister.dll
  SFA.DAS.EpaoRegister.UnitTests -> /azp/agent/_work/1/s/src/SFA.DAS.EpaoRegister.UnitTests/bin/release/netcoreapp3.1/SFA.DAS.EpaoRegister.UnitTests.dll

Build succeeded.
    0 Warning(s)
    0 Error(s)

Time Elapsed 00:00:01.15
/usr/bin/dotnet build /azp/agent/_work/1/s/src/SFA.DAS.EpaoRegister/SFA.DAS.EpaoRegister.csproj -dl:CentralLogger,"/azp/agent/_work/_tasks/DotNetCoreCLI_5541a522-603c-47ad-91fc-a4b1d163081b/2.181.0/dotnet-build-helpers/Microsoft.TeamFoundation.DistributedTask.MSBuild.Logger.dll"*ForwardingLogger,"/azp/agent/_work/_tasks/DotNetCoreCLI_5541a522-603c-47ad-91fc-a4b1d163081b/2.181.0/dotnet-build-helpers/Microsoft.TeamFoundation.DistributedTask.MSBuild.Logger.dll" --configuration release --no-restore
Microsoft (R) Build Engine version 16.7.2+b60ddb6f4 for .NET
Copyright (C) Microsoft Corporation. All rights reserved.

  SFA.DAS.SharedOuterApi -> /azp/agent/_work/1/s/src/SFA.DAS.SharedOuterApi/bin/release/netcoreapp3.1/SFA.DAS.SharedOuterApi.dll
  SFA.DAS.EpaoRegister -> /azp/agent/_work/1/s/src/SFA.DAS.EpaoRegister/bin/release/netcoreapp3.1/SFA.DAS.EpaoRegister.dll

Build succeeded.
    0 Warning(s)
    0 Error(s)

Time Elapsed 00:00:00.78
##[error]Error: The process '/usr/bin/dotnet' failed with exit code null

dotnet --info

root@azure-pipelines-build-agent-75ddfbcc4d-4ntn5:/azp# dotnet --info
.NET Core SDK (reflecting any global.json):
 Version:   3.1.405
 Commit:    3fae16e62e

Runtime Environment:
 OS Name:     ubuntu
 OS Version:  18.04
 OS Platform: Linux
 RID:         ubuntu.18.04-x64
 Base Path:   /usr/share/dotnet/sdk/3.1.405/

Host (useful for support):
  Version: 3.1.11
  Commit:  f5eceb8105

.NET Core SDKs installed:
  2.2.207 [/usr/share/dotnet/sdk]
  3.1.405 [/usr/share/dotnet/sdk]

.NET Core runtimes installed:
  Microsoft.AspNetCore.All 2.2.8 [/usr/share/dotnet/shared/Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.2.8 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 3.1.11 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 2.2.8 [/usr/share/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 3.1.11 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

The build succeeds but since the process is returning with exit code null the build process fails.

dotnet-issue-labeler[bot] commented 3 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 3 years ago

Tagging subscribers to this area: @vitek-karas, @agocke See info in area-owners.md if you want to be subscribed.

Issue Details
Now and then our build agent produces broken builds. The Error message reads: `##[error]Error: The process '/home/agent/agent/_work/_tool/dotnet/dotnet' failed with exit code null` The project is a dotnet core 3.1 web api solution with something like 30 projects, no unmanaged stuff at all. root cause is a segfault as seen in dmesg ``` $ dmesg | grep dotnet [17426.781072] dotnet[36429]: segfault at 18 ip 00007f9d65e87892 sp 00007f9d5e083bb0 error 4 in libpthread-2.27.so[7f9d65e7b000+1a000] [1418646.055501] dotnet[36089]: segfault at 18 ip 00007f345cea9892 sp 00007f33b9703eb0 error 4 in libpthread-2.27.so[7f345ce9d000+1a000] [2246615.917135] dotnet[87465]: segfault at 18 ip 00007fd998396382 sp 00007fd98fd373a0 error 4 in libpthread-2.27.so[7fd99838a000+1a000] [2362725.938722] dotnet[21158]: segfault at 18 ip 00007fe8ee98a892 sp 00007fe8e637ee00 error 4 in libpthread-2.27.so[7fe8ee97e000+1a000] [2432991.847286] dotnet[48481]: segfault at 18 ip 00007f7ac18e8892 sp 00007f7a46173b00 error 4 in libpthread-2.27.so[7f7ac18dc000+1a000] [2704555.425939] dotnet[88757]: segfault at 18 ip 00007fe0bc6bb892 sp 00007fe0b48b4ae0 error 4 in libpthread-2.27.so[7fe0bc6af000+1a000] [2846996.143322] dotnet[107654]: segfault at 18 ip 00007fad287ea892 sp 00007facad075b00 error 4 in libpthread-2.27.so[7fad287de000+1a000] [2853616.129105] dotnet[15803]: segfault at 18 ip 00007f72657db892 sp 00007f725d1cfb00 error 4 in libpthread-2.27.so[7f72657cf000+1a000] [3496394.984178] dotnet[59923]: segfault at 18 ip 00007f5d8ffe7892 sp 00007f5d889e1b00 error 4 in libpthread-2.27.so[7f5d8ffdb000+1a000] [3630179.291391] dotnet[98248]: segfault at 18 ip 00007f8d8079a892 sp 00007f8d78993e00 error 4 in libpthread-2.27.so[7f8d8078e000+1a000] [3633549.092183] dotnet[101217]: segfault at 18 ip 00007f617d49a892 sp 00007f60d9ce7e00 error 4 in libpthread-2.27.so[7f617d48e000+1a000] ``` Environment info: ``` NAME="Ubuntu" VERSION="18.04.5 LTS (Bionic Beaver)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 18.04.5 LTS" VERSION_ID="18.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=bionic UBUNTU_CODENAME=bionic ``` Build agents are equipped with 2vCPU and 2GB memory. `dotnet --info` is not available, as there is no runtime nor SDK installed. We're using the dotnet tool installer during build: ``` Tool to install: .NET Core sdk version 3.1.x. Found version 3.1.405 in channel 3.1 for user specified version spec: 3.1.x Version: 3.1.405 was found in cache. Creating global tool path and pre-pending to PATH. ``` I have no idea how to debug this. I'd like to provide more info, but need assistance to do so.
Author: marcwittke
Assignees: -
Labels: `area-Host`, `untriaged`
Milestone: -
BrennanConroy commented 3 years ago

AspNetCore is hitting an issue that looks very similar to this. We run some tests then call Environment.Exit(0); and are hitting a segfault.

We have a crash dump at https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-aspnetcore-refs-heads-main-bd6750238a114336b0/Microsoft.AspNetCore.Localization.Tests--net6.0/core.1000.9653?sv=2019-07-07&se=2021-04-13T17%3A14%3A17Z&sr=c&sp=rl&sig=2cdAaIh4bXj5NtvyeG%2FSxJtayROazUADEGUgmDPsOJM%3D and https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-aspnetcore-refs-heads-main-5f556472c38b49a59c/Microsoft.AspNetCore.Mvc.Abstractions.Test--net6.0/core.1000.22884?sv=2019-07-07&se=2021-04-06T23%3A42%3A17Z&sr=c&sp=rl&sig=PiaeRVjWTySpvofLo3Yofn6EAf3RZMRV89VI1uoFXLA%3D

Both show the thread that segfaulted at an address that looks like it is in the address space of the libpthread-2.27.so module.

The dumps will be around for a week or 2.

janvorli commented 3 years ago

I'll take a look at the dumps.

janvorli commented 3 years ago

@BrennanConroy what is the distro that the dumps came from?

BrennanConroy commented 3 years ago

Helix queue ubuntu.1804.amd64.open

For the first link: Runtime 6.0.0-preview.3.21167.1 Sdk 6.0.100-preview.3.21168.19

janvorli commented 3 years ago

What I can see in the dump is that the main thread has already exited and the crashing secondary thread is attempting to run some OpenSSL code and a lock address inside of libcrypto passed to CRYPTO_THREAD_write_lock is set to NULL. This sounds like the same issue as https://github.com/dotnet/runtime/issues/34231. Only that this time, it doesn't stem from the ERR_reason_error_string like in that issue, but from the following:

(lldb) clrstack -f
OS Thread Id: 0x25be (1)
        Child SP               IP Call Site
00007FA9E95108C0 00007FA9F1249892 libpthread.so.0!__pthread_rwlock_wrlock + 18
00007FA9E9510900 00007FA975A91989 libcrypto.so.1.1!CRYPTO_THREAD_write_lock + 9
00007FA9E9510910 00007FA975A53013 libcrypto.so.1.1!RAND_get_rand_method + 51
00007FA9E9510930 00007FA975A5333E libcrypto.so.1.1!RAND_priv_bytes + 14
00007FA9E9510950 00007FA9759759BD libcrypto.so.1.1!___lldb_unnamed_symbol375$$libcrypto.so.1.1 + 413
00007FA9E95109C0 00007FA975975B96 libcrypto.so.1.1!___lldb_unnamed_symbol376$$libcrypto.so.1.1 + 166
00007FA9E9510A10 00007FA975A0095B libcrypto.so.1.1!___lldb_unnamed_symbol984$$libcrypto.so.1.1 + 91
00007FA9E9510A50 00007FA9759BF41A libcrypto.so.1.1!___lldb_unnamed_symbol795$$libcrypto.so.1.1 + 906
00007FA9E9510AC0 00007FA9759BFD5D libcrypto.so.1.1!___lldb_unnamed_symbol796$$libcrypto.so.1.1 + 1229
00007FA9E9510BA0 00007FA9759BEDA4 libcrypto.so.1.1!EC_POINTs_mul + 324
00007FA9E9510C00 00007FA9759BEE10 libcrypto.so.1.1!EC_POINT_mul + 64
00007FA9E9510C40 00007FA9759C24DF libcrypto.so.1.1!___lldb_unnamed_symbol811$$libcrypto.so.1.1 + 175
00007FA9E9510CA0 00007FA9759BCD49 libcrypto.so.1.1!ECDH_compute_key + 89
00007FA9E9510D00 00007FA9759C18BC libcrypto.so.1.1!___lldb_unnamed_symbol802$$libcrypto.so.1.1 + 76
00007FA9E9510D20 00007FA9759C1A35 libcrypto.so.1.1!___lldb_unnamed_symbol803$$libcrypto.so.1.1 + 245
00007FA9E9510D80 00007FA975DA9317 libssl.so.1.1!___lldb_unnamed_symbol195$$libssl.so.1.1 + 343
00007FA9E9510DC0 00007FA975DCB304 libssl.so.1.1!___lldb_unnamed_symbol509$$libssl.so.1.1 + 1028
00007FA9E9510E10 00007FA975DC9157 libssl.so.1.1!___lldb_unnamed_symbol488$$libssl.so.1.1 + 1383
00007FA9E9510EE0 00007FA975DB54C4 libssl.so.1.1!SSL_do_handshake + 84
00007FA9E9510EE0 00007FA975DB54C4 libssl.so.1.1!SSL_do_handshake + 84
00007FA9E9510F20 00007FA97A6BB20E
00007FA9E9510F30                  [InlinedCallFrame: 00007fa9e9510f30] System.Net.Security.dll!Interop+Ssl.SslDoHandshake(Microsoft.Win32.SafeHandles.SafeSslHandle)
00007FA9E9510F30                  [InlinedCallFrame: 00007fa9e9510f30] System.Net.Security.dll!Interop+Ssl.SslDoHandshake(Microsoft.Win32.SafeHandles.SafeSslHandle)
00007FA9E9510F20 00007FA97A6BB20E System.Diagnostics.Process.dll!ILStubClass.IL_STUB_PInvoke(Microsoft.Win32.SafeHandles.SafeSslHandle) + 142
00007FA9E9510FC0 00007FA978D39EF2 System.Net.Security.dll!Interop+OpenSsl.DoSslHandshake(Microsoft.Win32.SafeHandles.SafeSslHandle, System.ReadOnlySpan`1<Byte>, Byte[] ByRef, Int32 ByRef) + 130
00007FA9E9511020 00007FA978D39168 System.Net.Security.dll!System.Net.Security.SslStreamPal.HandshakeInternal(System.Net.Security.SafeFreeCredentials, System.Net.Security.SafeDeleteSslContext ByRef, System.ReadOnlySpan`1<Byte>, Byte[] ByRef, System.Net.Security.SslAuthenticationOptions) + 168
00007FA9E95110D0 00007FA978D3791A System.Net.Security.dll!System.Net.Security.SecureChannel.GenerateToken(System.ReadOnlySpan`1<Byte>, Byte[] ByRef) + 138
00007FA9E9511140 00007FA978D3770E System.Net.Security.dll!System.Net.Security.SecureChannel.NextMessage(System.ReadOnlySpan`1<Byte>) + 62
00007FA9E9511190 00007FA978D3ABA7 System.Net.Security.dll!System.Net.Security.SslStream.ProcessBlob(Int32) + 327
00007FA9E9511200 00007FA978D63E66 System.Net.Security.dll!System.Net.Security.SslStream+<ReceiveBlobAsync>d__172`1[[System.Net.Security.AsyncReadWriteAdapter, System.Net.Security]].MoveNext() + 2230
00007FA9E95113D0 00007FA97A6BF2C0 System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.__Canon, System.Private.CoreLib],[System.Net.Security.SslStream+<ReceiveBlobAsync>d__172`1[[System.Net.Security.AsyncReadWriteAdapter, System.Net.Security]], System.Net.Security]].ExecutionContextCallback(System.Object) + 128 [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncTaskMethodBuilderT.cs @ 287]
00007FA9E9511410 00007FA97A6D2DF5 System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) + 149 [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs @ 208]
00007FA9E9511460 00007FA97A6BF0E0 System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.__Canon, System.Private.CoreLib],[System.Net.Security.SslStream+<ReceiveBlobAsync>d__172`1[[System.Net.Security.AsyncReadWriteAdapter, System.Net.Security]], System.Net.Security]].MoveNext(System.Threading.Thread) + 288 [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncTaskMethodBuilderT.cs @ 336]
00007FA9E95114E0 00007FA97A6BEF99 System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.__Canon, System.Private.CoreLib],[System.Net.Security.SslStream+<ReceiveBlobAsync>d__172`1[[System.Net.Security.AsyncReadWriteAdapter, System.Net.Security]], System.Net.Security]].MoveNext() + 25 [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncTaskMethodBuilderT.cs @ 302]
00007FA9E9511500 00007FA97A6D2FC6 System.Private.CoreLib.dll!System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Runtime.CompilerServices.IAsyncStateMachineBox, Boolean) + 214 [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/TaskContinuation.cs @ 805]
00007FA9E9511540 00007FA97A6D2554 System.Private.CoreLib.dll!System.Threading.Tasks.Task.RunContinuations(System.Object) + 212 [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs @ 3472]
00007FA9E95115F0 00007FA9763E4970 System.Private.CoreLib.dll!System.Threading.Tasks.Task`1[[System.Int32, System.Private.CoreLib]].TrySetResult(Int32) + 144 [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Future.cs @ 404]
00007FA9E9511620 00007FA9763E8BB6 System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.Int32, System.Private.CoreLib]].SetExistingTaskResult(System.Threading.Tasks.Task`1<Int32>, Int32) + 86 [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncTaskMethodBuilderT.cs @ 443]
00007FA9E9511650 00007FA9763E8D24 System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder`1[[System.Int32, System.Private.CoreLib]].SetResult(Int32) + 116 [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncValueTaskMethodBuilderT.cs @ 67]
00007FA9E9511680 00007FA978D68248 System.Net.Security.dll!System.Net.Security.SslStream+<<FillHandshakeBufferAsync>g__InternalFillHandshakeBufferAsync|181_0>d`1[[System.Net.Security.AsyncReadWriteAdapter, System.Net.Security]].MoveNext() + 488
00007FA9E9511730 00007FA97A6BEF5E System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Int32, System.Private.CoreLib],[System.Net.Security.SslStream+<<FillHandshakeBufferAsync>g__InternalFillHandshakeBufferAsync|181_0>d`1[[System.Net.Security.AsyncReadWriteAdapter, System.Net.Security]], System.Net.Security]].ExecutionContextCallback(System.Object) + 62 [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncTaskMethodBuilderT.cs @ 287]
00007FA9E9511750 00007FA97A6D2DF5 System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) + 149 [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs @ 208]
00007FA9E95117A0 00007FA97A6BEE29 System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Int32, System.Private.CoreLib],[System.Net.Security.SslStream+<<FillHandshakeBufferAsync>g__InternalFillHandshakeBufferAsync|181_0>d`1[[System.Net.Security.AsyncReadWriteAdapter, System.Net.Security]], System.Net.Security]].MoveNext(System.Threading.Thread) + 217 [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncTaskMethodBuilderT.cs @ 336]
00007FA9E95117F0 00007FA97A6BED29 System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Int32, System.Private.CoreLib],[System.Net.Security.SslStream+<<FillHandshakeBufferAsync>g__InternalFillHandshakeBufferAsync|181_0>d`1[[System.Net.Security.AsyncReadWriteAdapter, System.Net.Security]], System.Net.Security]].MoveNext() + 25 [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncTaskMethodBuilderT.cs @ 302]
00007FA9E9511810 00007FA9762BA852 System.Private.CoreLib.dll!System.Threading.ThreadPool+<>c.<.cctor>b__82_0(System.Object) + 34 [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs @ 1055]
00007FA9E9511820 00007FA97A943A29 System.Net.Sockets.dll!System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs.InvokeContinuation(System.Action`1<System.Object>, System.Object, Boolean, Boolean) + 361
00007FA9E9511870 00007FA97A9437E3 System.Net.Sockets.dll!System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs.OnCompleted(System.Net.Sockets.SocketAsyncEventArgs) + 179
00007FA9E95118D0 00007FA97A95A6C3 System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEventArgs.OnCompletedInternal() + 83
00007FA9E95118F0 00007FA97A9446BE System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEventArgs.FinishOperationAsyncSuccess(Int32, System.Net.Sockets.SocketFlags) + 46
00007FA9E9511910 00007FA97A945BB6 System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEventArgs.TransferCompletionCallbackCore(Int32, Byte[], Int32, System.Net.Sockets.SocketFlags, System.Net.Sockets.SocketError) + 54
00007FA9E9511940 00007FA97A945AF4 System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncContext+BufferMemoryReceiveOperation.InvokeCallback(Boolean) + 132
00007FA9E9511990 00007FA97A964B2B System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncContext+OperationQueue`1[[System.__Canon, System.Private.CoreLib]].ProcessAsyncOperation(System.__Canon) + 91
00007FA9E95119C0 00007FA97A945917 System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncContext+ReadOperation.System.Threading.IThreadPoolWorkItem.Execute() + 39
00007FA9E95119D0 00007FA97A944588 System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncContext.HandleEvents(SocketEvents) + 120
00007FA9E9511A00 00007FA97A9444B1 System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute() + 129
00007FA9E9511A40 00007FA97A6D6EAC System.Private.CoreLib.dll!System.Threading.ThreadPoolWorkQueue.Dispatch() + 364 [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs @ 769]
00007FA9E9511AC0 00007FA9762CF8C8 System.Private.CoreLib.dll!System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart() + 264 [/_/src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.WorkerThread.cs @ 58]
00007FA9E9511B80 00007FA9762B6028 System.Private.CoreLib.dll!System.Threading.Thread.StartCallback() + 104 [/_/src/coreclr/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs @ 105]
00007FA9E9511BA0 00007FA9EFF60487 libcoreclr.so!___lldb_unnamed_symbol9589$$libcoreclr.so + 124
00007FA9E9511BC0 00007FA9EFDBF1CE libcoreclr.so!___lldb_unnamed_symbol4452$$libcoreclr.so + 254
00007FA9E9511C50 00007FA9EFDD0372 libcoreclr.so!___lldb_unnamed_symbol4638$$libcoreclr.so + 146
00007FA9E9511CA0 00007FA9EFD8680A libcoreclr.so!___lldb_unnamed_symbol3792$$libcoreclr.so + 330
00007FA9E9511CF0                  [DebuggerU2MCatchHandlerFrame: 00007fa9e9511cf0]
00007FA9E9511DC0 00007FA9EFD86E0D libcoreclr.so!___lldb_unnamed_symbol3793$$libcoreclr.so + 45
00007FA9E9511DF0 00007FA9EFDD044C libcoreclr.so!___lldb_unnamed_symbol4639$$libcoreclr.so + 188
00007FA9E9511E50 00007FA9F00F3B0E libcoreclr.so!___lldb_unnamed_symbol15450$$libcoreclr.so + 590
00007FA9E9511F00 00007FA9F12446DB libpthread.so.0!start_thread + 219
00007FA9E9511FC0 00007FA9F042A71F libc.so.6!__clone + 63

cc: @bartonjs

bartonjs commented 3 years ago

Given that Ubuntu 18.04 has explicitly removed support for NO_ATEXIT, I worry we'll end up just finding one intermittent problem after another. The previous fix assumed that everything other than the string table was graceful about post-exit calls, but apparently calls into the RNG hit a failure while trying to reinitialize it.

Feels like our choices are:

danmoseley commented 3 years ago

Is it feasible/useful to offer a change to OpenSSL?

Although perhaps this is a problem others might have to solve when interopping with a different native library that has similar expectations.

bartonjs commented 3 years ago

Is it feasible/useful to offer a change to OpenSSL?

OpenSSL supports the scenario, and we opt into it (OPENSSL_INIT_NO_ATEXIT):

https://github.com/dotnet/runtime/blob/400311b032fe5d05b49fb6a813e4a5a60604d8dd/src/libraries/Native/Unix/System.Security.Cryptography.Native/openssl.c#L1287-L1301

The Ubuntu 18.04 build.... somewhere that I found before that I didn't write down and am having trouble finding again... explicitly removes support for that option.

danmoseley commented 3 years ago

Ah got it. And later versions - 20.04 etc?

danmoseley commented 3 years ago

Change the shim to guard every function with an if-shutting-down-exit while using something like interlocked increment/decrement to notify the atexit handler that we can release the library for further shutdown

This seems like the only reasonable possibility. It seems like the next critical thing to know is whether this also affects 20.04+. That would make it more important to fix since presumably 20.04 or later is an option for many 18.04 customers.

@bartonjs we know how to find that out? Here's what I have on my 20.04 machine with apt-get upgrade run:

dan@LAPTOP-P6UJDVTA:/usr$ file ./lib/x86_64-linux-gnu/libcrypto.so.1.1
./lib/x86_64-linux-gnu/libcrypto.so.1.1: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=d30abd770d1215fff0f9a0fa9f12b1de5b50da29, stripped
dan@LAPTOP-P6UJDVTA:/usr$ file ./lib/x86_64-linux-gnu/libssl.so.1.1
./lib/x86_64-linux-gnu/libssl.so.1.1: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=4ef02cf97dd73cb0a88495e6dbf584dd6aa5aa22, stripped

From the above info I'm unsure how to determine.

danmoseley commented 3 years ago

Those SHA's aren't in the OpenSSL repo and it's not clear where in https://launchpad.net/ubuntu to find the sources Ubuntu used.

Anyway I don't know what to look for.

bartonjs commented 3 years ago

https://packages.ubuntu.com/source/focal/openssl says that Focal is based on OpenSSL 1.1.1f (plus servicing patches), and in 1.1.1f the source looked like

https://github.com/openssl/openssl/blob/36eadf1f84daa965041cce410b4ff32cbda4ef08/crypto/init.c#L620-L656

So if you do something like

$ lldb /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
(lldb) target create "/usr/lib/x86_64-linux-gnu/libcrypto.so.1.1"
Current executable set to '/usr/lib/x86_64-linux-gnu/libcrypto.so.1.1' (x86_64).
(lldb) dis -n OPENSSL_init_crypto
libcrypto.so.1.1`OPENSSL_init_crypto:
libcrypto.so.1.1[0x176bc0] <+0>:    pushq  %rbp
libcrypto.so.1.1[0x176bc1] <+1>:    pushq  %rbx
libcrypto.so.1.1[0x176bc2] <+2>:    movq   %rdi, %rbx
libcrypto.so.1.1[0x176bc5] <+5>:    subq   $0x8, %rsp
libcrypto.so.1.1[0x176bc9] <+9>:    movl   0x352c59(%rip), %eax
libcrypto.so.1.1[0x176bcf] <+15>:   testl  %eax, %eax
libcrypto.so.1.1[0x176bd1] <+17>:   je     0x176c18                  ; <+88>
libcrypto.so.1.1[0x176bd3] <+19>:   testl  $0x40000, %edi            ; imm = 0x40000
libcrypto.so.1.1[0x176bd9] <+25>:   je     0x176bf0                  ; <+48>
libcrypto.so.1.1[0x176bdb] <+27>:   xorl   %ebp, %ebp
libcrypto.so.1.1[0x176bdd] <+29>:   addq   $0x8, %rsp
libcrypto.so.1.1[0x176be1] <+33>:   movl   %ebp, %eax
libcrypto.so.1.1[0x176be3] <+35>:   popq   %rbx
libcrypto.so.1.1[0x176be4] <+36>:   popq   %rbp
libcrypto.so.1.1[0x176be5] <+37>:   retq
libcrypto.so.1.1[0x176be6] <+38>:   nopw   %cs:(%rax,%rax)
libcrypto.so.1.1[0x176bf0] <+48>:   leaq   0xc23cc(%rip), %rcx
libcrypto.so.1.1[0x176bf7] <+55>:   movl   $0x252, %r8d              ; imm = 0x252
libcrypto.so.1.1[0x176bfd] <+61>:   movl   $0x46, %edx
libcrypto.so.1.1[0x176c02] <+66>:   movl   $0x74, %esi
libcrypto.so.1.1[0x176c07] <+71>:   movl   $0xf, %edi
libcrypto.so.1.1[0x176c0c] <+76>:   callq  0x1580e0                  ; ERR_put_error
libcrypto.so.1.1[0x176c11] <+81>:   jmp    0x176bdb                  ; <+27>
libcrypto.so.1.1[0x176c13] <+83>:   nopl   (%rax,%rax)
libcrypto.so.1.1[0x176c18] <+88>:   movq   %rsi, %rbp
libcrypto.so.1.1[0x176c1b] <+91>:   leaq   0x352bee(%rip), %rdi
libcrypto.so.1.1[0x176c22] <+98>:   leaq   -0x2e9(%rip), %rsi        ; ___lldb_unnamed_symbol1395$$libcrypto.so.1.1
libcrypto.so.1.1[0x176c29] <+105>:  callq  0x1df9f0                  ; CRYPTO_THREAD_run_once
libcrypto.so.1.1[0x176c2e] <+110>:  testl  %eax, %eax
libcrypto.so.1.1[0x176c30] <+112>:  je     0x176bdb                  ; <+27>
libcrypto.so.1.1[0x176c32] <+114>:  movl   0x352bd0(%rip), %eax
libcrypto.so.1.1[0x176c38] <+120>:  testl  %eax, %eax
libcrypto.so.1.1[0x176c3a] <+122>:  je     0x176bdb                  ; <+27>
libcrypto.so.1.1[0x176c3c] <+124>:  testl  $0x40000, %ebx            ; imm = 0x40000
libcrypto.so.1.1[0x176c42] <+130>:  je     0x176d70                  ; <+432>
libcrypto.so.1.1[0x176c48] <+136>:  testb  $0x1, %bl
libcrypto.so.1.1[0x176c4b] <+139>:  jne    0x176d10                  ; <+336>
libcrypto.so.1.1[0x176c51] <+145>:  testb  $0x2, %bl
libcrypto.so.1.1[0x176c54] <+148>:  jne    0x176d40                  ; <+384>
libcrypto.so.1.1[0x176c5a] <+154>:  testb  $0x10, %bl
libcrypto.so.1.1[0x176c5d] <+157>:  jne    0x176da0                  ; <+480>
libcrypto.so.1.1[0x176c63] <+163>:  testb  $0x4, %bl
libcrypto.so.1.1[0x176c66] <+166>:  jne    0x176dd0                  ; <+528>
libcrypto.so.1.1[0x176c6c] <+172>:  testb  $0x20, %bl
libcrypto.so.1.1[0x176c6f] <+175>:  jne    0x176e00                  ; <+576>
libcrypto.so.1.1[0x176c75] <+181>:  testb  $0x8, %bl
libcrypto.so.1.1[0x176c78] <+184>:  jne    0x176e2e                  ; <+622>
libcrypto.so.1.1[0x176c7e] <+190>:  testl  $0x20000, %ebx            ; imm = 0x20000

(from Ubuntu 18.04)

hopefully there'll be something that looks like it's doing a test for 0x80000. If so, the problem is just gone on 20.04.

I've previously said that Ubuntu "removed" the support. Looking again, I don't see a patch that removes the support... but I also don't see one that adds it. The OPENSSL_INIT_NO_ATEXIT support was backported for OpenSSL 1.1.1b. It looks like Ubuntu 18.04 is 1.1.1 (RTM) plus servicing, and their servicing did something other than "catch up to 1.1.1-stable".

danmoseley commented 3 years ago
Ubuntu 20.04 output ``` dan@LAPTOP-P6UJDVTA:/usr$ cat /etc/os-release | grep VERSION VERSION="20.04.2 LTS (Focal Fossa)" VERSION_ID="20.04" VERSION_CODENAME=focal ``` ```asm (lldb) target create "/usr/lib/x86_64-linux-gnu/libcrypto.so.1.1" Current executable set to '/usr/lib/x86_64-linux-gnu/libcrypto.so.1.1' (x86_64). (lldb) dis -n OPENSSL_init_crypto libcrypto.so.1.1`OPENSSL_init_crypto: libcrypto.so.1.1[0x177590] <+0>: endbr64 libcrypto.so.1.1[0x177594] <+4>: movl 0x15d296(%rip), %eax libcrypto.so.1.1[0x17759a] <+10>: pushq %r12 libcrypto.so.1.1[0x17759c] <+12>: pushq %rbp libcrypto.so.1.1[0x17759d] <+13>: pushq %rbx libcrypto.so.1.1[0x17759e] <+14>: movq %rdi, %rbx libcrypto.so.1.1[0x1775a1] <+17>: testl %eax, %eax libcrypto.so.1.1[0x1775a3] <+19>: je 0x1775c0 ; <+48> libcrypto.so.1.1[0x1775a5] <+21>: testl $0x40000, %edi ; imm = 0x40000 libcrypto.so.1.1[0x1775ab] <+27>: je 0x1776e8 ; <+344> libcrypto.so.1.1[0x1775b1] <+33>: xorl %r12d, %r12d libcrypto.so.1.1[0x1775b4] <+36>: movl %r12d, %eax libcrypto.so.1.1[0x1775b7] <+39>: popq %rbx libcrypto.so.1.1[0x1775b8] <+40>: popq %rbp libcrypto.so.1.1[0x1775b9] <+41>: popq %r12 libcrypto.so.1.1[0x1775bb] <+43>: retq libcrypto.so.1.1[0x1775bc] <+44>: nopl (%rax) libcrypto.so.1.1[0x1775c0] <+48>: movq %rsi, %rbp libcrypto.so.1.1[0x1775c3] <+51>: leaq 0x15d24e(%rip), %rdi libcrypto.so.1.1[0x1775ca] <+58>: leaq -0x2d1(%rip), %rsi ; ___lldb_unnamed_symbol1509$$libcrypto.so.1.1 libcrypto.so.1.1[0x1775d1] <+65>: callq 0x1e2780 ; CRYPTO_THREAD_run_once libcrypto.so.1.1[0x1775d6] <+70>: testl %eax, %eax libcrypto.so.1.1[0x1775d8] <+72>: je 0x1775b1 ; <+33> libcrypto.so.1.1[0x1775da] <+74>: movl 0x15d230(%rip), %eax libcrypto.so.1.1[0x1775e0] <+80>: testl %eax, %eax libcrypto.so.1.1[0x1775e2] <+82>: je 0x1775b1 ; <+33> libcrypto.so.1.1[0x1775e4] <+84>: movl $0x1, %r12d libcrypto.so.1.1[0x1775ea] <+90>: testl $0x40000, %ebx ; imm = 0x40000 libcrypto.so.1.1[0x1775f0] <+96>: jne 0x1775b4 ; <+36> libcrypto.so.1.1[0x1775f2] <+98>: testl $0x80000, %ebx ; imm = 0x80000 libcrypto.so.1.1[0x1775f8] <+104>: je 0x177718 ; <+392> libcrypto.so.1.1[0x1775fe] <+110>: leaq -0x565(%rip), %rsi ; ___lldb_unnamed_symbol1491$$libcrypto.so.1.1 libcrypto.so.1.1[0x177605] <+117>: leaq 0x15d200(%rip), %rdi libcrypto.so.1.1[0x17760c] <+124>: callq 0x1e2780 ; CRYPTO_THREAD_run_once libcrypto.so.1.1[0x177611] <+129>: testl %eax, %eax libcrypto.so.1.1[0x177613] <+131>: je 0x1775b1 ; <+33> libcrypto.so.1.1[0x177615] <+133>: movl 0x15d1ec(%rip), %r12d libcrypto.so.1.1[0x17761c] <+140>: testl %r12d, %r12d libcrypto.so.1.1[0x17761f] <+143>: je 0x1775b1 ; <+33> libcrypto.so.1.1[0x177621] <+145>: leaq -0x578(%rip), %rsi ; ___lldb_unnamed_symbol1492$$libcrypto.so.1.1 libcrypto.so.1.1[0x177628] <+152>: leaq 0x15d1d5(%rip), %rdi libcrypto.so.1.1[0x17762f] <+159>: callq 0x1e2780 ; CRYPTO_THREAD_run_once libcrypto.so.1.1[0x177634] <+164>: testl %eax, %eax libcrypto.so.1.1[0x177636] <+166>: je 0x1775b1 ; <+33> libcrypto.so.1.1[0x17763c] <+172>: movl 0x15d1bd(%rip), %r11d libcrypto.so.1.1[0x177643] <+179>: testl %r11d, %r11d libcrypto.so.1.1[0x177646] <+182>: je 0x1775b1 ; <+33> libcrypto.so.1.1[0x17764c] <+188>: testb $0x1, %bl libcrypto.so.1.1[0x17764f] <+191>: jne 0x177738 ; <+424> libcrypto.so.1.1[0x177655] <+197>: testb $0x2, %bl libcrypto.so.1.1[0x177658] <+200>: jne 0x177768 ; <+472> libcrypto.so.1.1[0x17765e] <+206>: testb $0x10, %bl libcrypto.so.1.1[0x177661] <+209>: jne 0x177798 ; <+520> libcrypto.so.1.1[0x177667] <+215>: testb $0x4, %bl libcrypto.so.1.1[0x17766a] <+218>: jne 0x1777c8 ; <+568> libcrypto.so.1.1[0x177670] <+224>: testb $0x20, %bl libcrypto.so.1.1[0x177673] <+227>: jne 0x1777f6 ; <+614> libcrypto.so.1.1[0x177679] <+233>: testb $0x8, %bl libcrypto.so.1.1[0x17767c] <+236>: jne 0x177824 ; <+660> libcrypto.so.1.1[0x177682] <+242>: testl $0x20000, %ebx ; imm = 0x20000 libcrypto.so.1.1[0x177688] <+248>: jne 0x177852 ; <+706> libcrypto.so.1.1[0x17768e] <+254>: testb $-0x80, %bl libcrypto.so.1.1[0x177691] <+257>: jne 0x177864 ; <+724> libcrypto.so.1.1[0x177697] <+263>: testb $0x40, %bl libcrypto.so.1.1[0x17769a] <+266>: jne 0x177892 ; <+770> libcrypto.so.1.1[0x1776a0] <+272>: testb $0x1, %bh libcrypto.so.1.1[0x1776a3] <+275>: jne 0x1778df ; <+847> libcrypto.so.1.1[0x1776a9] <+281>: testb $0x8, %bh libcrypto.so.1.1[0x1776ac] <+284>: jne 0x17790d ; <+893> libcrypto.so.1.1[0x1776b2] <+290>: testb $0x2, %bh libcrypto.so.1.1[0x1776b5] <+293>: jne 0x17793a ; <+938> libcrypto.so.1.1[0x1776bb] <+299>: testb $0x4, %bh libcrypto.so.1.1[0x1776be] <+302>: jne 0x177991 ; <+1025> libcrypto.so.1.1[0x1776c4] <+308>: testb $-0x2, %bh libcrypto.so.1.1[0x1776c7] <+311>: jne 0x1779eb ; <+1115> libcrypto.so.1.1[0x1776cd] <+317>: testl $0x10000, %ebx ; imm = 0x10000 libcrypto.so.1.1[0x1776d3] <+323>: jne 0x1779be ; <+1070> libcrypto.so.1.1[0x1776d9] <+329>: movl $0x1, %r12d libcrypto.so.1.1[0x1776df] <+335>: jmp 0x1775b4 ; <+36> libcrypto.so.1.1[0x1776e4] <+340>: nopl (%rax) libcrypto.so.1.1[0x1776e8] <+344>: xorl %r12d, %r12d libcrypto.so.1.1[0x1776eb] <+347>: movl $0x270, %r8d ; imm = 0x270 libcrypto.so.1.1[0x1776f1] <+353>: movl $0x46, %edx libcrypto.so.1.1[0x1776f6] <+358>: movl $0x74, %esi libcrypto.so.1.1[0x1776fb] <+363>: leaq 0xc86d1(%rip), %rcx libcrypto.so.1.1[0x177702] <+370>: movl $0xf, %edi libcrypto.so.1.1[0x177707] <+375>: callq 0x157990 ; ERR_put_error libcrypto.so.1.1[0x17770c] <+380>: movl %r12d, %eax libcrypto.so.1.1[0x17770f] <+383>: popq %rbx libcrypto.so.1.1[0x177710] <+384>: popq %rbp libcrypto.so.1.1[0x177711] <+385>: popq %r12 libcrypto.so.1.1[0x177713] <+387>: retq libcrypto.so.1.1[0x177714] <+388>: nopl (%rax) libcrypto.so.1.1[0x177718] <+392>: leaq -0x44f(%rip), %rsi ; ___lldb_unnamed_symbol1508$$libcrypto.so.1.1 libcrypto.so.1.1[0x17771f] <+399>: leaq 0x15d0e6(%rip), %rdi libcrypto.so.1.1[0x177726] <+406>: callq 0x1e2780 ; CRYPTO_THREAD_run_once libcrypto.so.1.1[0x17772b] <+411>: testl %eax, %eax libcrypto.so.1.1[0x17772d] <+413>: jne 0x177615 ; <+133> libcrypto.so.1.1[0x177733] <+419>: jmp 0x1775b1 ; <+33> libcrypto.so.1.1[0x177738] <+424>: leaq -0x67f(%rip), %rsi ; ___lldb_unnamed_symbol1493$$libcrypto.so.1.1 libcrypto.so.1.1[0x17773f] <+431>: leaq 0x15d0b6(%rip), %rdi libcrypto.so.1.1[0x177746] <+438>: callq 0x1e2780 ; CRYPTO_THREAD_run_once libcrypto.so.1.1[0x17774b] <+443>: testl %eax, %eax libcrypto.so.1.1[0x17774d] <+445>: je 0x1775b1 ; <+33> libcrypto.so.1.1[0x177753] <+451>: movl 0x15d09a(%rip), %r10d libcrypto.so.1.1[0x17775a] <+458>: testl %r10d, %r10d libcrypto.so.1.1[0x17775d] <+461>: jne 0x177655 ; <+197> libcrypto.so.1.1[0x177763] <+467>: jmp 0x1775b1 ; <+33> libcrypto.so.1.1[0x177768] <+472>: leaq -0x4cf(%rip), %rsi ; ___lldb_unnamed_symbol1507$$libcrypto.so.1.1 libcrypto.so.1.1[0x17776f] <+479>: leaq 0x15d086(%rip), %rdi libcrypto.so.1.1[0x177776] <+486>: callq 0x1e2780 ; CRYPTO_THREAD_run_once libcrypto.so.1.1[0x17777b] <+491>: testl %eax, %eax libcrypto.so.1.1[0x17777d] <+493>: je 0x1775b1 ; <+33> libcrypto.so.1.1[0x177783] <+499>: movl 0x15d06a(%rip), %r9d libcrypto.so.1.1[0x17778a] <+506>: testl %r9d, %r9d libcrypto.so.1.1[0x17778d] <+509>: jne 0x17765e ; <+206> libcrypto.so.1.1[0x177793] <+515>: jmp 0x1775b1 ; <+33> libcrypto.so.1.1[0x177798] <+520>: leaq -0x6cf(%rip), %rsi ; ___lldb_unnamed_symbol1494$$libcrypto.so.1.1 libcrypto.so.1.1[0x17779f] <+527>: leaq 0x15d04a(%rip), %rdi libcrypto.so.1.1[0x1777a6] <+534>: callq 0x1e2780 ; CRYPTO_THREAD_run_once libcrypto.so.1.1[0x1777ab] <+539>: testl %eax, %eax libcrypto.so.1.1[0x1777ad] <+541>: je 0x1775b1 ; <+33> libcrypto.so.1.1[0x1777b3] <+547>: movl 0x15d032(%rip), %r8d libcrypto.so.1.1[0x1777ba] <+554>: testl %r8d, %r8d libcrypto.so.1.1[0x1777bd] <+557>: jne 0x177667 ; <+215> libcrypto.so.1.1[0x1777c3] <+563>: jmp 0x1775b1 ; <+33> libcrypto.so.1.1[0x1777c8] <+568>: leaq -0x54f(%rip), %rsi ; ___lldb_unnamed_symbol1506$$libcrypto.so.1.1 libcrypto.so.1.1[0x1777cf] <+575>: leaq 0x15d01a(%rip), %rdi libcrypto.so.1.1[0x1777d6] <+582>: callq 0x1e2780 ; CRYPTO_THREAD_run_once libcrypto.so.1.1[0x1777db] <+587>: testl %eax, %eax libcrypto.so.1.1[0x1777dd] <+589>: je 0x1775b1 ; <+33> libcrypto.so.1.1[0x1777e3] <+595>: movl 0x15d003(%rip), %edi libcrypto.so.1.1[0x1777e9] <+601>: testl %edi, %edi libcrypto.so.1.1[0x1777eb] <+603>: jne 0x177670 ; <+224> libcrypto.so.1.1[0x1777f1] <+609>: jmp 0x1775b1 ; <+33> libcrypto.so.1.1[0x1777f6] <+614>: leaq -0x71d(%rip), %rsi ; ___lldb_unnamed_symbol1495$$libcrypto.so.1.1 libcrypto.so.1.1[0x1777fd] <+621>: leaq 0x15cfe4(%rip), %rdi libcrypto.so.1.1[0x177804] <+628>: callq 0x1e2780 ; CRYPTO_THREAD_run_once libcrypto.so.1.1[0x177809] <+633>: testl %eax, %eax libcrypto.so.1.1[0x17780b] <+635>: je 0x1775b1 ; <+33> libcrypto.so.1.1[0x177811] <+641>: movl 0x15cfcd(%rip), %esi libcrypto.so.1.1[0x177817] <+647>: testl %esi, %esi libcrypto.so.1.1[0x177819] <+649>: jne 0x177679 ; <+233> libcrypto.so.1.1[0x17781f] <+655>: jmp 0x1775b1 ; <+33> libcrypto.so.1.1[0x177824] <+660>: leaq -0x5cb(%rip), %rsi ; ___lldb_unnamed_symbol1505$$libcrypto.so.1.1 libcrypto.so.1.1[0x17782b] <+667>: leaq 0x15cfb6(%rip), %rdi libcrypto.so.1.1[0x177832] <+674>: callq 0x1e2780 ; CRYPTO_THREAD_run_once libcrypto.so.1.1[0x177837] <+679>: testl %eax, %eax libcrypto.so.1.1[0x177839] <+681>: je 0x1775b1 ; <+33> libcrypto.so.1.1[0x17783f] <+687>: movl 0x15cf9f(%rip), %ecx libcrypto.so.1.1[0x177845] <+693>: testl %ecx, %ecx libcrypto.so.1.1[0x177847] <+695>: jne 0x177682 ; <+242> libcrypto.so.1.1[0x17784d] <+701>: jmp 0x1775b1 ; <+33> libcrypto.so.1.1[0x177852] <+706>: callq 0x1e2850 ; ___lldb_unnamed_symbol1948$$libcrypto.so.1.1 libcrypto.so.1.1[0x177857] <+711>: testl %eax, %eax libcrypto.so.1.1[0x177859] <+713>: jne 0x17768e ; <+254> libcrypto.so.1.1[0x17785f] <+719>: jmp 0x1775b1 ; <+33> libcrypto.so.1.1[0x177864] <+724>: leaq -0x62b(%rip), %rsi ; ___lldb_unnamed_symbol1504$$libcrypto.so.1.1 libcrypto.so.1.1[0x17786b] <+731>: leaq 0x15cf6e(%rip), %rdi libcrypto.so.1.1[0x177872] <+738>: callq 0x1e2780 ; CRYPTO_THREAD_run_once libcrypto.so.1.1[0x177877] <+743>: testl %eax, %eax libcrypto.so.1.1[0x177879] <+745>: je 0x1775b1 ; <+33> libcrypto.so.1.1[0x17787f] <+751>: movl 0x15cf4b(%rip), %edx libcrypto.so.1.1[0x177885] <+757>: testl %edx, %edx libcrypto.so.1.1[0x177887] <+759>: jne 0x177697 ; <+263> libcrypto.so.1.1[0x17788d] <+765>: jmp 0x1775b1 ; <+33> libcrypto.so.1.1[0x177892] <+770>: movq 0x15cf87(%rip), %rdi libcrypto.so.1.1[0x177899] <+777>: callq 0x1e2700 ; CRYPTO_THREAD_write_lock libcrypto.so.1.1[0x17789e] <+782>: leaq -0x685(%rip), %rsi ; ___lldb_unnamed_symbol1503$$libcrypto.so.1.1 libcrypto.so.1.1[0x1778a5] <+789>: leaq 0x15cf34(%rip), %rdi libcrypto.so.1.1[0x1778ac] <+796>: movq %rbp, 0x15cf25(%rip) libcrypto.so.1.1[0x1778b3] <+803>: callq 0x1e2780 ; CRYPTO_THREAD_run_once libcrypto.so.1.1[0x1778b8] <+808>: movl %eax, %r12d libcrypto.so.1.1[0x1778bb] <+811>: testl %eax, %eax libcrypto.so.1.1[0x1778bd] <+813>: jne 0x177967 ; <+983> libcrypto.so.1.1[0x1778c3] <+819>: movq 0x15cf56(%rip), %rdi libcrypto.so.1.1[0x1778ca] <+826>: movq $0x0, 0x15cf03(%rip) libcrypto.so.1.1[0x1778d5] <+837>: callq 0x1e2720 ; CRYPTO_THREAD_unlock libcrypto.so.1.1[0x1778da] <+842>: jmp 0x1775b4 ; <+36> libcrypto.so.1.1[0x1778df] <+847>: leaq -0x6f6(%rip), %rsi ; ___lldb_unnamed_symbol1502$$libcrypto.so.1.1 libcrypto.so.1.1[0x1778e6] <+854>: leaq 0x15cedf(%rip), %rdi libcrypto.so.1.1[0x1778ed] <+861>: callq 0x1e2780 ; CRYPTO_THREAD_run_once libcrypto.so.1.1[0x1778f2] <+866>: testl %eax, %eax libcrypto.so.1.1[0x1778f4] <+868>: je 0x1775b1 ; <+33> libcrypto.so.1.1[0x1778fa] <+874>: movl 0x15cec4(%rip), %eax libcrypto.so.1.1[0x177900] <+880>: testl %eax, %eax libcrypto.so.1.1[0x177902] <+882>: jne 0x1776a9 ; <+281> libcrypto.so.1.1[0x177908] <+888>: jmp 0x1775b1 ; <+33> libcrypto.so.1.1[0x17790d] <+893>: leaq -0x744(%rip), %rsi ; ___lldb_unnamed_symbol1501$$libcrypto.so.1.1 libcrypto.so.1.1[0x177914] <+900>: leaq 0x15cea5(%rip), %rdi libcrypto.so.1.1[0x17791b] <+907>: callq 0x1e2780 ; CRYPTO_THREAD_run_once libcrypto.so.1.1[0x177920] <+912>: testl %eax, %eax libcrypto.so.1.1[0x177922] <+914>: je 0x1775b1 ; <+33> libcrypto.so.1.1[0x177928] <+920>: cmpl $0x0, 0x15ce8d(%rip) libcrypto.so.1.1[0x17792f] <+927>: jne 0x1776b2 ; <+290> libcrypto.so.1.1[0x177935] <+933>: jmp 0x1775b1 ; <+33> libcrypto.so.1.1[0x17793a] <+938>: leaq -0x791(%rip), %rsi ; ___lldb_unnamed_symbol1500$$libcrypto.so.1.1 libcrypto.so.1.1[0x177941] <+945>: leaq 0x15ce70(%rip), %rdi libcrypto.so.1.1[0x177948] <+952>: callq 0x1e2780 ; CRYPTO_THREAD_run_once libcrypto.so.1.1[0x17794d] <+957>: testl %eax, %eax libcrypto.so.1.1[0x17794f] <+959>: je 0x1775b1 ; <+33> libcrypto.so.1.1[0x177955] <+965>: cmpl $0x0, 0x15ce58(%rip) libcrypto.so.1.1[0x17795c] <+972>: jne 0x1776bb ; <+299> libcrypto.so.1.1[0x177962] <+978>: jmp 0x1775b1 ; <+33> libcrypto.so.1.1[0x177967] <+983>: movl 0x15ce63(%rip), %ebp libcrypto.so.1.1[0x17796d] <+989>: movq 0x15ceac(%rip), %rdi libcrypto.so.1.1[0x177974] <+996>: movq $0x0, 0x15ce59(%rip) libcrypto.so.1.1[0x17797f] <+1007>: callq 0x1e2720 ; CRYPTO_THREAD_unlock libcrypto.so.1.1[0x177984] <+1012>: testl %ebp, %ebp libcrypto.so.1.1[0x177986] <+1014>: jg 0x1776a0 ; <+272> libcrypto.so.1.1[0x17798c] <+1020>: jmp 0x1775b1 ; <+33> libcrypto.so.1.1[0x177991] <+1025>: leaq -0x808(%rip), %rsi ; ___lldb_unnamed_symbol1499$$libcrypto.so.1.1 libcrypto.so.1.1[0x177998] <+1032>: leaq 0x15ce11(%rip), %rdi libcrypto.so.1.1[0x17799f] <+1039>: callq 0x1e2780 ; CRYPTO_THREAD_run_once libcrypto.so.1.1[0x1779a4] <+1044>: testl %eax, %eax libcrypto.so.1.1[0x1779a6] <+1046>: je 0x1775b1 ; <+33> libcrypto.so.1.1[0x1779ac] <+1052>: cmpl $0x0, 0x15cdf9(%rip) libcrypto.so.1.1[0x1779b3] <+1059>: jne 0x1776c4 ; <+308> libcrypto.so.1.1[0x1779b9] <+1065>: jmp 0x1775b1 ; <+33> libcrypto.so.1.1[0x1779be] <+1070>: leaq -0x8d5(%rip), %rsi ; ___lldb_unnamed_symbol1496$$libcrypto.so.1.1 libcrypto.so.1.1[0x1779c5] <+1077>: leaq 0x15cddc(%rip), %rdi libcrypto.so.1.1[0x1779cc] <+1084>: callq 0x1e2780 ; CRYPTO_THREAD_run_once libcrypto.so.1.1[0x1779d1] <+1089>: testl %eax, %eax libcrypto.so.1.1[0x1779d3] <+1091>: je 0x1775b1 ; <+33> libcrypto.so.1.1[0x1779d9] <+1097>: cmpl $0x0, 0x15cdc0(%rip) libcrypto.so.1.1[0x1779e0] <+1104>: jne 0x1776d9 ; <+329> libcrypto.so.1.1[0x1779e6] <+1110>: jmp 0x1775b1 ; <+33> libcrypto.so.1.1[0x1779eb] <+1115>: callq 0x1535f0 ; ENGINE_register_all_complete libcrypto.so.1.1[0x1779f0] <+1120>: jmp 0x1776cd ; <+317> (lldb) ```

@bartonjs I see a test against 0x80000 ...

bartonjs commented 3 years ago

::kermit arms:: Yaaaaaaay!

Looks like we won't have a problem on 20.04. Hopefully that's enough to avoid needing to add our own locking/refcounting/whatever to literally every shim method.

danmoseley commented 3 years ago

@marcwittke is it possible for you to try on Ubuntu 20.04 or later? We think that will fix it. It is an issue in the libcrypto on 18.04.

marcwittke commented 3 years ago

I think so. We have two agents running right now on 18.04. I'll update one of them to 20.04 and let's see. Since it's intermitting, I think in a week I can give you a watermark whether it helped or not.

NecatiMeral commented 3 years ago

We've upgraded our build agents to Ubuntu 20.04 a week ago (after @danmoseley's reply) and so far we haven't experienced this error anymore.

danmoseley commented 3 years ago

Great, when @marcwittke can confirm also, we can close this.

marcwittke commented 3 years ago

seems to be fixed in Ubuntu 20.04, we had no segfaults any more on our upgraded build agent

BrennanConroy commented 3 years ago

So to be clear, 18.04 is still buggy and 20.04 is fixed?

bartonjs commented 3 years ago

Yeah. The problem is that the build of OpenSSL on Ubuntu 18.04 doesn't respect the OPENSSL_INIT_NO_ATEXIT flag, so it starts tearing down OpenSSL locks and statics when the main thread exits, but .NET Background Threads can still be calling into OpenSSL.

The 20.04 build has NO_ATEXIT support.

danmoseley commented 2 years ago

I'm not sure we'd arrived at consensus that we wouldn't take a fix here ...18.04 is supported until 2028 and we'll presumably support it in .NET 7. This is also causing our automated tests to crash periodically.

I'll leave this open for other customers to comment on the impact. But the recommendation above remains to move to 20.04 ID affected.

hoyosjs commented 2 years ago

As seen in https://github.com/dotnet/sdk/pull/22872#issuecomment-988636583

This seems to be https://github.com/dotnet/runtime/issues/48411 which happens on 18.04 as seen here.

The stack:

00 00007f0e`d15f7c10 00007f0f`305b5959     libpthread_2_27!pthread_rwlock_wrlock+0x12
01 00007f0e`d15f7c50 00007f0f`30577013     libcrypto_so_1!CRYPTO_THREAD_write_lock+0x9
02 00007f0e`d15f7c60 00007f0f`305772f0     libcrypto_so_1!RAND_get_rand_method+0x33
03 00007f0e`d15f7c80 00007f0f`3053449f     libcrypto_so_1!RAND_bytes+0x10
04 00007f0e`d15f7ca0 00007f0f`30542a97     libcrypto_so_1!EVP_MD_CTX_ctrl+0x132f
05 00007f0e`d15f7cd0 00007f0f`a31a5804     libcrypto_so_1!EVP_CIPHER_CTX_ctrl+0x17
06 00007f0e`d15f7ce0 00007f0f`a319761a     libssl_so_1!SSL_in_before+0x13bd4
07 00007f0e`d15f7e30 00007f0f`a3192006     libssl_so_1!SSL_in_before+0x59ea
08 00007f0e`d15f7e40 00007f0f`a317e4e4     libssl_so_1!SSL_in_before+0x3d6
09 00007f0e`d15f7f10 00007f0f`334590a0     libssl_so_1!SSL_do_handshake+0x54
0a 00007f0e`d15f7f50 00007f0f`334571b6     Interop+Ssl.<SslDoHandshake>g____PInvoke__|26_0(IntPtr)+0x40
0b 00007f0e`d15f7ff0 00007f0f`3345bdec     System_Net_Security!Interop+Ssl.SslDoHandshake(Microsoft.Win32.SafeHandles.SafeSslHandle)+0x56 [/_/src/libraries/System.Net.Security/src/Microsoft.Interop.DllImportGenerator/Microsoft.Interop.DllImportGenerator/GeneratedDllImports.g.cs @ 3487] 
0c 00007f0e`d15f8030 00007f0f`3347a3e5     System_Net_Security!Interop+OpenSsl.DoSslHandshake(Microsoft.Win32.SafeHandles.SafeSslHandle, System.ReadOnlySpan`1<Byte>, Byte[] ByRef, Int32 ByRef)+0x8c [/_/src/libraries/Common/src/Interop/Unix/System.Security.Cryptography.Native/Interop.OpenSsl.cs @ 338] 
0d 00007f0e`d15f80a0 00007f0f`33467c75     System_Net_Security!System.Net.Security.SslStreamPal.HandshakeInternal(System.Net.Security.SafeFreeCredentials, System.Net.Security.SafeDeleteSslContext ByRef, System.ReadOnlySpan`1<Byte>, Byte[] ByRef, System.Net.Security.SslAuthenticationOptions)+0xb5 [/_/src/libraries/System.Net.Security/src/System/Net/Security/SslStreamPal.Unix.cs @ 161] 
0e 00007f0e`d15f8170 00007f0f`33467989     System_Net_Security!System.Net.Security.SecureChannel.GenerateToken(System.ReadOnlySpan`1<Byte>, Byte[] ByRef)+0x155 [/_/src/libraries/System.Net.Security/src/System/Net/Security/SecureChannel.cs @ 803] 
0f 00007f0e`d15f8210 00007f0f`3346ced7     System_Net_Security!System.Net.Security.SecureChannel.NextMessage(System.ReadOnlySpan`1<Byte>)+0x39 [/_/src/libraries/System.Net.Security/src/System/Net/Security/SecureChannel.cs @ 725] 
10 00007f0e`d15f8280 00007f0f`3347e742     System_Net_Security!System.Net.Security.SslStream.ProcessBlob(Int32)+0x157 [/_/src/libraries/System.Net.Security/src/System/Net/Security/SslStream.Implementation.cs @ 593] 
11 00007f0e`d15f8310 00000000`00000000     System_Net_Security!System.Net.Security.SslStream+<ReceiveBlobAsync>d__174`1[[System.Net.Security.AsyncReadWriteAdapter, System.Net.Security]].MoveNext()+0x9a2 [/_/src/libraries/System.Net.Security/src/System/Net/Security/SslStream.Implementation.cs @ 555] 

pthread_rwlock_wrlock has the following disassembly from entry to faulting point:

libpthread_2_27!pthread_rwlock_wrlock:
00007f0f`ac2fc880 4157            push    r15
00007f0f`ac2fc882 4156            push    r14
00007f0f`ac2fc884 4155            push    r13
00007f0f`ac2fc886 4154            push    r12
00007f0f`ac2fc888 55              push    rbp
00007f0f`ac2fc889 53              push    rbx
00007f0f`ac2fc88a 4889fb          mov     rbx,rdi
00007f0f`ac2fc88d 4883ec08        sub     rsp,8
00007f0f`ac2fc891 90              nop
00007f0f`ac2fc892 8b5718          mov     edx,dword ptr [rdi+18h]

The segv is from reading RDI + 0x18 = 0x18. RBX and RDX are indeed 0. RDI in SysV is the first parameter passed, which is pthread_rwlock_t*. That's passed in from https://github.com/openssl/openssl/blob/b1553c89285cb05a28d185423bc3df9b505db92a/crypto/threads_pthread.c#L75-L86; called from RAND_get_rand_method with a C-static lock, rand_meth_lock, which doesn't support reinitialization in 18.04.

bartonjs commented 2 years ago

I'm not sure we'd arrived at consensus that we wouldn't take a fix here

The only complete fix we could take would be to run literally every shim function to OpenSSL under the same mutex we use for loading exception strings, to work around applications doing work on background threads after the main thread has exited (because these crashes are only after exit() has been called / main() has exited).

The biggest offender seems to be SSL_do_handshake; so we /might/ be able to start the game of whack-a-mole by making TLS handshakes mutexed; but I don't think that the networking team would like that. (We could probably change our mutex to a rwlock so we don't utterly kill parallelism with TLS handshakes, but it's still not free)

bartonjs commented 2 years ago

I've also not tried working with Canonical to get them to just patch in the support for OPENSSL_NO_ATEXIT. @richlander do you have any contacts there?

richlander commented 2 years ago

I do. Hey @wiswaud -- can you get us a contact at Canonical who can help us with some OpenSSL issues on Ubuntu 18.04?

richlander commented 2 years ago

I was given an official account Canonical account to report issues via their tracker. That was quick.

@bartonjs Can you write a succinct description of the issue that I can copy/paste into the Canonical tracker?

bartonjs commented 2 years ago

@richlander How's this?

Bionic's OpenSSL 1.1.1 package (https://launchpad.net/ubuntu/bionic/+source/openssl) is the only version of openssl 1.1.1 on any distro that we've encountered that does not have support for the OPENSSL_NO_ATEXIT functionality from 1.1.1b (https://github.com/openssl/openssl/commit/c2b3db245452f185948b4f767f7e1051b6bd59a7).

The threading model in .NET has the possibility that background threads are still running when exit() is called, which can cause SIGSEGV if a background thread interacts with OpenSSL after/while it has unloaded. For that reason, we always initialize OpenSSL 1.1.1 with the OPENSSL_NO_ATEXIT flag (which, of all the distros we run on only has no effect on Bionic).

We feel that the stability of applications on Ubuntu 18.04 would be improved if the functionality of OPENSSL_NO_ATEXIT was merged into the bionic openssl 1.1.1 package, even if the constant isn't published into the header for the dev package.

richlander commented 2 years ago

Perfect! Thanks much.

janvorli commented 2 years ago

I have been hitting this crash recently on my main devbox, which is Ubuntu 18.04. However, it started to happen relatively recently, at most a month ago build was stable. So maybe something has changed in the msbuild that makes this occur much more frequently or something like that.

richlander commented 2 years ago

If you have a good repro, that would be useful. I am taking with Canonical now.

janvorli commented 2 years ago

Unfortunately I don't. It crashes on average once in a day or two when running ./build.sh script.

richlander commented 2 years ago

That's OK. Let's see if we can get a fix and then maybe deploy some early fixes.

krwq commented 1 year ago

Simple repro code for testing:

using System;
using System.Runtime.InteropServices;
using System.Security.Cryptography;

atexit(AtExitHandler);

byte[] data = new byte[] { 0, 1, 2, 3, 4, 5 };
byte[] hashValue;

using (SHA256 sha256 = SHA256.Create())
{
    hashValue = sha256.ComputeHash(data);
}

Console.WriteLine($"hash: {ToHex(hashValue)}");

[DllImport("libc", EntryPoint = "__cxa_atexit", CallingConvention = CallingConvention.Cdecl)]
static extern int atexit(Action a);

static void AtExitHandler()
{
    byte[] randomBytes = new byte[16];
    RandomNumberGenerator.Fill(randomBytes);
    Console.WriteLine($"random: {ToHex(randomBytes)}");
}

static string ToHex(byte[] bytes)
{
    return string.Join("", bytes.Select((b) => b.ToString("X2")));
}

In case your mangled name of atexit differs to get a correct one:

nm -D `ldd \`which echo\` | grep libc | cut '--delimiter= ' -f 3` | grep 'atexit\>' | cut '--delimiter= ' -f 3
# for me prints: __cxa_atexit

we should probably do .so and wrap atexit file since atexit is only source compatible and in some places it's documented that it takes 2 extra args but it's meant to be a simple demonstration of issue... adjust as needed...

richlander commented 1 year ago

We are in the late stages of getting Canonical to publish a fix in Ubuntu 18.04 via their ESM program. I believe the easiest way to access that is via Ubuntu Pro.

richlander commented 1 year ago

The fix has been released in libssl package version 1.1.1-1ubuntu2.1~18.04.23+esm1.

Here are my repro steps to acquire that package: https://gist.github.com/richlander/47333cbf90ee0ee3f51bcb0dbbb3a76f?permalink_comment_id=4676592#gistcomment-4676592