dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.06k stars 4.69k forks source link

dotnet seems to abort when libmsquic exist but cannot be loaded #82316

Closed wfurt closed 1 year ago

wfurt commented 1 year ago

https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-81973-merge-7e0ae34d93e042d592/System.Net.Quic.Functional.Tests/1/console.1beb6b05.log?helixlogtype=result

docker image mcr.microsoft.com/dotnet-buildtools/prereqs:cbl-mariner-2.0-helix-amd64-staging

/root/helix/work/correlation/dotnet exec --runtimeconfig System.Net.Quic.Functional.Tests.runtimeconfig.json --depsfile System.Net.Quic.Functional.Tests.deps.json xunit.console.dll System.Net.Quic.Functional.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing 
popd
===========================================================================================================
/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: System.Net.Quic.Functional.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Net.Quic.Functional.Tests (found 109 of 115 test cases)
  Starting:    System.Net.Quic.Functional.Tests (parallel test collections = on, max threads = 2)
    System.Net.Quic.Tests.MsQuicPlatformDetectionTests.SupportedWindowsPlatforms_IsSupportedIsTrue [SKIP]
      Condition(s) not met: "IsWindows"
    System.Net.Quic.Tests.MsQuicPlatformDetectionTests.UnsupportedPlatforms_ThrowsPlatformNotSupportedException [SKIP]
      Condition(s) not met: "IsQuicUnsupported"
/root/helix/work/correlation/dotnet: symbol lookup error: /lib64/libmsquic.so.2: undefined symbol: EVP_chacha20_poly1305, version OPENSSL_1_1_0
/root/helix/work/workitem/e
----- end Sat Feb 11 01:03:45 UTC 2023 ----- exit code 127 ----------------------------------------------------------
ulimit -c value: unlimited
./RunTests.sh: line 190: dmesg: command not found

I would expect new tests to fail but the test run even did not finish. Underlying MsQuic issue is tracked here: https://github.com/microsoft/msquic/issues/3422

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.

Issue Details
https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-81973-merge-7e0ae34d93e042d592/System.Net.Quic.Functional.Tests/1/console.1beb6b05.log?helixlogtype=result ``` /root/helix/work/correlation/dotnet exec --runtimeconfig System.Net.Quic.Functional.Tests.runtimeconfig.json --depsfile System.Net.Quic.Functional.Tests.deps.json xunit.console.dll System.Net.Quic.Functional.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing popd =========================================================================================================== /root/helix/work/workitem/e /root/helix/work/workitem/e Discovering: System.Net.Quic.Functional.Tests (method display = ClassAndMethod, method display options = None) Discovered: System.Net.Quic.Functional.Tests (found 109 of 115 test cases) Starting: System.Net.Quic.Functional.Tests (parallel test collections = on, max threads = 2) System.Net.Quic.Tests.MsQuicPlatformDetectionTests.SupportedWindowsPlatforms_IsSupportedIsTrue [SKIP] Condition(s) not met: "IsWindows" System.Net.Quic.Tests.MsQuicPlatformDetectionTests.UnsupportedPlatforms_ThrowsPlatformNotSupportedException [SKIP] Condition(s) not met: "IsQuicUnsupported" /root/helix/work/correlation/dotnet: symbol lookup error: /lib64/libmsquic.so.2: undefined symbol: EVP_chacha20_poly1305, version OPENSSL_1_1_0 /root/helix/work/workitem/e ----- end Sat Feb 11 01:03:45 UTC 2023 ----- exit code 127 ---------------------------------------------------------- ulimit -c value: unlimited ./RunTests.sh: line 190: dmesg: command not found ``` I would expect new tests to fail but the test run even did not finish. Underlying MsQuic issue is tracked here: https://github.com/microsoft/msquic/issues/3422
Author: wfurt
Assignees: -
Labels: `os-linux`, `area-System.Net.Quic`
Milestone: -
ManickaP commented 1 year ago

IMHO it should not crash as we try to load the library and return IsSupported = false in case cannot. I looked and couldn't find any dump from this failure, so we need to test this locally and understand it, we might have overlooked something. How can we do that @wfurt?

wfurt commented 1 year ago

I execute the test in Centos 7 or Marriner container on my dev machine. (e.g. build on Ubuntu 20 and map the build tree to container using --volume)

wfurt commented 1 year ago

This seems to be issue with NativeLibrary.TryLoad. It happily gives back handle even if there are missing dependencies and it blows up later when we try to use the API table. It feels like what we would need is equivalent of RTLD_NOW from dlopen.

cc: @jkotas @janvorli for any more thoughts.

janvorli commented 1 year ago

I would prefer not changing to using RTLD_NOW in general, as it seems to me that it can break valid scenarios. The RTLD_LAZY causes a problem for the case when something is incorrectly installed or incorrectly referencing missing dependencies. But there are scenarios where RTLD_LAZY is needed to ensure proper behavior. I have found some of these scenarios described here: http://www.qnx.com/developers/docs/qnxcar2/index.jsp?topic=%2Fcom.qnx.doc.neutrino.prog%2Ftopic%2Fdevel_RTLD_LAZY.html. As a side effect, it also speeds up library loading in case of libraries with a lot of symbols.

What would seem good to do though is to add a new version of TryLoad function with an argument to pass in the RTLD_xxx flags to the NativeLibrary.TryLoad. People could then decide what is appropriate for their scenarios.

jkotas commented 1 year ago

If you need custom dlopen flags, you can P/Invoke dlopen yourself. We intentionally did not provide managed API that allows you to pass custom flags. A lot of these flags are platform specific.

I do not think it is a good idea to start loading libmsquic with RTLD_NOW. RTLD_NOW is bad for startup performance.

The root cause of this issue was a bug in libmsquic library that you have fixed. Bugs in the native libraries can cause the process to crash.

ManickaP commented 1 year ago

We could work around this with try/catch around the GetExport or inside TryOpenMsQuic if that's what is throwing.

jkotas commented 1 year ago

Both GetExport and TryOpenMsQuic completed successfully. The crash happened later when executing libmsquic code.

wfurt commented 1 year ago

It mostly come as a surprise. It mostly worked as expected in the past - like cases when libmsquic depends on libcrypto.so.1.1 but only libcrypto.so.3 is available. The behavior is unpleasant as we report IsSupported = true and than we fail with error user cannot catch. (and it seems like msquic will have more dependencies in next version)

But I think we can probably live with that as it would be rare. msquic is trying to make single binary work with various distributions - just like .NET but it is lacking the PAL capabilities we have. I'm not sure I would call it bug as flavors of OpneSSL can differ across distributions. And yes, I specifically submitted changes to msquic so we can run on Linux distributions .NET supports.

While we have DOTNET_SYSTEM_NET_HTTP_SOCKETSHTTPHANDLER_HTTP3SUPPORT perhaps we can also think about bypass switch for Quic itself. The one above is for example not applicable to Kestrel it it would be nice IMHO to have some mechanism in place to disable Quic operations is somebody bumps to it.

CarnaViire commented 1 year ago

Triage: there's nothing reasonable we can do. This should not be visible to users on supported platforms. Closing