Closed wfurt closed 1 year ago
Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.
Author: | wfurt |
---|---|
Assignees: | - |
Labels: | `os-linux`, `area-System.Net.Quic` |
Milestone: | - |
IMHO it should not crash as we try to load the library and return IsSupported = false
in case cannot.
I looked and couldn't find any dump from this failure, so we need to test this locally and understand it, we might have overlooked something. How can we do that @wfurt?
I execute the test in Centos 7 or Marriner container on my dev machine. (e.g. build on Ubuntu 20 and map the build tree to container using --volume)
This seems to be issue with NativeLibrary.TryLoad
. It happily gives back handle even if there are missing dependencies and it blows up later when we try to use the API table. It feels like what we would need is equivalent of RTLD_NOW
from dlopen
.
cc: @jkotas @janvorli for any more thoughts.
I would prefer not changing to using RTLD_NOW
in general, as it seems to me that it can break valid scenarios. The RTLD_LAZY
causes a problem for the case when something is incorrectly installed or incorrectly referencing missing dependencies. But there are scenarios where RTLD_LAZY
is needed to ensure proper behavior. I have found some of these scenarios described here: http://www.qnx.com/developers/docs/qnxcar2/index.jsp?topic=%2Fcom.qnx.doc.neutrino.prog%2Ftopic%2Fdevel_RTLD_LAZY.html.
As a side effect, it also speeds up library loading in case of libraries with a lot of symbols.
What would seem good to do though is to add a new version of TryLoad
function with an argument to pass in the RTLD_xxx
flags to the NativeLibrary.TryLoad
. People could then decide what is appropriate for their scenarios.
If you need custom dlopen
flags, you can P/Invoke dlopen
yourself. We intentionally did not provide managed API that allows you to pass custom flags. A lot of these flags are platform specific.
I do not think it is a good idea to start loading libmsquic with RTLD_NOW
. RTLD_NOW
is bad for startup performance.
The root cause of this issue was a bug in libmsquic library that you have fixed. Bugs in the native libraries can cause the process to crash.
We could work around this with try/catch
around the GetExport
or inside TryOpenMsQuic
if that's what is throwing.
Both GetExport
and TryOpenMsQuic
completed successfully. The crash happened later when executing libmsquic code.
It mostly come as a surprise. It mostly worked as expected in the past - like cases when libmsquic
depends on libcrypto.so.1.1
but only libcrypto.so.3
is available. The behavior is unpleasant as we report IsSupported = true
and than we fail with error user cannot catch. (and it seems like msquic will have more dependencies in next version)
But I think we can probably live with that as it would be rare. msquic is trying to make single binary work with various distributions - just like .NET but it is lacking the PAL capabilities we have. I'm not sure I would call it bug as flavors of OpneSSL can differ across distributions. And yes, I specifically submitted changes to msquic so we can run on Linux distributions .NET supports.
While we have DOTNET_SYSTEM_NET_HTTP_SOCKETSHTTPHANDLER_HTTP3SUPPORT
perhaps we can also think about bypass switch for Quic
itself. The one above is for example not applicable to Kestrel it it would be nice IMHO to have some mechanism in place to disable Quic
operations is somebody bumps to it.
Triage: there's nothing reasonable we can do. This should not be visible to users on supported platforms. Closing
https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-81973-merge-7e0ae34d93e042d592/System.Net.Quic.Functional.Tests/1/console.1beb6b05.log?helixlogtype=result
docker image mcr.microsoft.com/dotnet-buildtools/prereqs:cbl-mariner-2.0-helix-amd64-staging
I would expect new tests to fail but the test run even did not finish. Underlying MsQuic issue is tracked here: https://github.com/microsoft/msquic/issues/3422