dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.85k stars 4.62k forks source link

LTTNG-UST 2.13 changes soname breaking loading of libcoreclrtraceptprovider #57784

Open hoyosjs opened 3 years ago

hoyosjs commented 3 years ago

There's an ABI breaking change between LTTNG-ust 2.12 and 2.13, so they've reved their soname from liblttng-ust.so.0 to liblttng-ust.so.1. We link against the former. Given some distros will start supporting newer versions, customers will likely end up upgrading packages and realize they can't load libcoreclrtraceptprovider.so. We should investigate avenues to support multiple versions as needed using compilation tricks like the ones used in System.Security for openssl. Other possibility is to only support 2.13 and that brings us come code sanity advantages - this would only happen in .NET 7+ realistically.

@brianrob @dotnet/dotnet-diag

ghost commented 3 years ago

Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.

Issue Details
There's an ABI breaking change between LTTNG-ust 2.12 and 2.13, so they've reved their soname from `liblttng-ust.so.0` to `liblttng-ust.so.1`. We link against the former. Given some distros will start supporting newer versions, customers will likely end up upgrading packages and realize they can't load `libcoreclrtraceptprovider.so`. We should investigate avenues to support multiple versions as needed using compilation tricks like the ones used in System.Security for openssl. Other possibility is to only support 2.13 and that brings us come code sanity advantages - this would only happen in .NET 7+ realistically. @brianrob @dotnet/dotnet-diag
Author: hoyosjs
Assignees: -
Labels: `area-Diagnostics-coreclr`
Milestone: -
am11 commented 3 years ago

using compilation tricks like the ones used in System.Security for openssl

Note, same (shim) tricks are used for libnuma.so, which is closer to where coreclrtraceptprovider lives. 🙂

hoyosjs commented 3 years ago

Thanks. The shim approach is tempting. We sort of codegen the trace points, so this one will be interesting if we decide to support both.

am11 commented 2 years ago

After some digging, it turned out lttng/tracepoint.h header (the one we use) does things quite differently than libssl, libnuma ones. It differs in the following manner:

Here is the expansion of macros for our generated tpdotnetruntimemonoprofiler.h:

# on Ubuntu x64
$ ./build.sh clr
$ clang-9 -E \
    artifacts/obj/coreclr/Linux.x64.Debug/pal/src/eventprovider/lttngprovider/lttng/tpdotnetruntimemonoprofiler.h \
    -Isrc/coreclr/pal/inc/rt -Isrc/coreclr/pal/inc > dump

the dump looks like this: http://sprunge.us/J5rsv3 (and file tpdotnetruntimemonoprofiler.h: http://sprunge.us/LPmIuJ). Search for __tracepoints__init or dlopen("liblttng. It is quite noisy, so I made a copy, removed #include "palrt.h" etc. and kept only one probe, here is the shorter version of this expansion: http://sprunge.us/BK0q1V, at the very end of this file is the code of interest.

I will defer to @janvorli as to the best approach for lttngshim. One option is that we install both versions on the official build machine, and compile both v0 and v1 implementations in our code with our pseudo init; which will probe for .so.1, and if not found, .so.0, then select the implementation based on the result. This approach will prevent us from stamping out symbols of liblttng-ust{-tracepoint}.so and we will still be using their init mechanism (which seem to take care of "shared access" scenarios).

janvorli commented 2 years ago

@am11 thank you for the analysis! I'll take a look into the possibilities of shimming lttng and consider your suggestion.

am11 commented 2 years ago

Thanks. I actually found about dlopen/dlsym pattern in their code when I commented out these lines (to see what breaks) https://github.com/dotnet/runtime/blob/57bfe474518ab5b7cfe6bf7424a79ce3af9d6657/src/coreclr/pal/src/eventprovider/lttngprovider/CMakeLists.txt#L63-L68 just to find out it had no effect on the build result (still succeeds) other than we (unnecessarily) link to liblttng-ust{-tracepoint}.so.0. :slightly_smiling_face: All public APIs in their header are already shim-like macro (without multi-version support):

janvorli commented 2 years ago

It seems that we will need to create and distribute two flavors of the libcoreclrtraceptprovider.so, one compiled with one version of lttng and the other with the other one. Then we would try to load one and if we fail, try to load the other. It seems that's essentially what @am11 has suggested, right @am11?

hoyosjs commented 2 years ago

Yes, and that also seems to be what needs to happen from my understanding.

am11 commented 2 years ago

@janvorli, yes, that's what I was thinking. It will be cleaner, as it will save us from using macro post-expansion calls. The quickest way to get there is to duplicate python code (as macro names are changed in 2.13).

Part II - python free implementation

However, in a separate project, we can replace the python scripts with lttngshim which will dynamically define those macros (at build time) and we can expand it without needing to build code. It is more work as python scripts read the XML (ETL manifest) file and generates multiple header and code files. If we think replacing python is a good idea, then I was imagining for:

brianrob commented 2 years ago

One thing to add here: Using src\coreclr\vm\ClrEtwAll.man as the one source of truth means that all other tools that consume events only have to look at one source of truth. For example, TraceEvent-based tools use the manifest to generate C# code that gets used to parse the events.

am11 commented 2 years ago

One thing to add here: Using src\coreclr\vm\ClrEtwAll.man as the one source of truth means that all other tools that consume events only have to look at one source of truth

I understand the rationale, and using XmlPeek -based parsing to dynamically generate the header is an option. However, in varied parts of runtime code base, we do manually keep lists in-sync which infrequently change, a few examples: https://github.com/dotnet/runtime/blob/57bfe474518ab5b7cfe6bf7424a79ce3af9d6657/src/coreclr/debug/inc/dbgtargetcontext.h#L29 https://github.com/dotnet/runtime/blob/462ab6e7a65dec97944e408b1959e71276f83988/src/libraries/Common/src/Interop/Unix/System.Native/Interop.UnixFileSystemTypes.cs#L15 https://github.com/dotnet/runtime/blob/57bfe474518ab5b7cfe6bf7424a79ce3af9d6657/src/libraries/Native/Unix/System.Globalization.Native/pal_calendarData.h#L13

brianrob commented 2 years ago

Agree that we have precedent for duplicate lists that have to be kept in sync, I'd just rather not add another one. :)

hoyosjs commented 2 years ago

Yes, but eventing is one of those things that often doesn't get updated much, but it gets updated by more than their known maintainers. Maintaining it outside the manifest is brittle. That's the main reason the python scripts have changed. It could easily be a little C# app that we stick in clr.prereqs and reads the manifest; we could stub them for bring up or generate them in a separate machine if needed if all we are trying to do is get rid of the Python dependency. There's prior art for this with the DAC table generation.

am11 commented 2 years ago

but it gets updated by more than their known maintainers

Looking at git history of ClrEtwAll.man, it doesn't seem that way.

It could easily be a little C# app

Why would the C# app better than a simple msbuild XmlPeek like https://github.com/dotnet/runtime/blob/b201a16e1a642f9532c8ea4e42d23af8f4484a36/src/libraries/System.Runtime.InteropServices/tests/TestAssets/NativeExports/NativeExports.csproj#L49?

janvorli commented 2 years ago

Both C# app and msbuild would be problematic for bring-up of new platforms where we need to be able to build runtime native parts without pre-existing .NET runtime. So it seems it would be preferable to use a shell script instead if we wanted to drop python.

hoyosjs commented 2 years ago

@janvorli for bring up, do you consider LTTNG crucial? Eventing in general is important for breadcrumbs, but I don't see why on bring up the source can't be checked in for the time being or stubbed out until it's deemed usable. We didn't want bash - we'd need bash + cmd. That's why it's python right now. I don't see all that much issue in python at the moment, but I know @am11 and @jkotas have a different opinion around having it as a dependency.

am11 commented 2 years ago

I was thinking about a header containing a list of macros with event definitions (similar toecalllist.h), checked into the source control.

That header file containing definitions could be generated/update with msbuild task when msbuild is available. On new platforms, where msbuild is not available and we pass -skipmanaged etc. to build-runtime.sh (i.e. not a common scenario or something used by any CI leg), chances of someone adding a new event that updates ETL manifest in the same commit where they are porting runtime on a new platform does not sound likely.

Regarding the shell script option, it would require an additional dependency on xml parsing tool which I was avoiding.

janvorli commented 2 years ago

for bring up, do you consider LTTNG crucial?

No, however we would need to enable building runtime without eventing. However, it would be nicer to be able to build all native parts even on a new platform. In the past, we've done bringup using an existing .NET SDK as a basis where we've replaced all native binaries be ones built on the new platform. So having all of them makes things complete. Also, I am not sure how would the source build that is expected to be used by distros handle this case.

jkotas commented 2 years ago

I was thinking about a header containing a list of macros with event definition

This looks similar to what I have suggested in https://github.com/dotnet/runtime/pull/39052#issuecomment-656709021

Also, I am not sure how would the source build that is expected to be used by distros handle this case.

I believe that source build assumes pre-built SDK. It does not bootstrap from nothing.

janvorli commented 2 years ago

This looks similar to what I have suggested in #39052 (comment)

Ah, I've forgotten about this comment, that sounds like the best solution to take care of my concern.

omajid commented 2 years ago

FWIW, Fedora 36 has picked up this version of lttng-ust and I am seeing crashes (in crossgen2!) there while trying to build .NET 6. Here's the backtrace from lldb:

* thread #1, name = 'crossgen2', stop reason = signal SIGSEGV: invalid address (fault address: 0x10)
    frame #0: 0x00007ffff71dcb82 liblttng-ust.so.1`check_event_provider + 162
liblttng-ust.so.1`check_event_provider:
->  0x7ffff71dcb82 <+162>: movq   0x10(%rax), %rdi
    0x7ffff71dcb86 <+166>: callq  0x7ffff71dca40            ; check_type_provider
    0x7ffff71dcb8b <+171>: movl   %eax, %r12d
    0x7ffff71dcb8e <+174>: testb  %al, %al
(lldb) bt
* thread #1, name = 'crossgen2', stop reason = signal SIGSEGV: invalid address (fault address: 0x10)
  * frame #0: 0x00007ffff71dcb82 liblttng-ust.so.1`check_event_provider + 162
    frame #1: 0x00007ffff71e24d1 liblttng-ust.so.1`lttng_ust_probe_register + 33
    frame #2: 0x00007ffff72957b5 libcoreclrtraceptprovider.so`lttng_ust__events_init__DotNETRuntime() at ust-tracepoint-event.h:1198:14

Is there a fix for this planned?

am11 commented 2 years ago

Is there a fix for this planned?

Current workaround is to disable event trace by deleting the following block before build: https://github.com/dotnet/runtime/blob/9d4ce4048c08317e320d5b826df7f12d50b7d083/src/coreclr/clrfeatures.cmake#L5-L7

Supporting both ABI versions of liblttng-ust in parallel would require more work than a few liner fix.

omajid commented 2 years ago

Disabling feature like this:

diff --git a/src/coreclr/clrfeatures.cmake b/src/coreclr/clrfeatures.cmake
index f82ff1aa4e7..dfab148dbc3 100644
--- a/src/coreclr/clrfeatures.cmake
+++ b/src/coreclr/clrfeatures.cmake
@@ -3,7 +3,7 @@ if(CLR_CMAKE_TARGET_TIZEN_LINUX)
 endif()

 if(NOT DEFINED FEATURE_EVENT_TRACE)
-  set(FEATURE_EVENT_TRACE 1)
+  set(FEATURE_EVENT_TRACE 0)
 endif(NOT DEFINED FEATURE_EVENT_TRACE)

 if(NOT DEFINED FEATURE_PERFTRACING AND FEATURE_EVENT_TRACE)

Or even

diff --git a/src/coreclr/clrfeatures.cmake b/src/coreclr/clrfeatures.cmake
index f82ff1aa4e7..54a61c808e6 100644
--- a/src/coreclr/clrfeatures.cmake
+++ b/src/coreclr/clrfeatures.cmake
@@ -3,7 +3,6 @@ if(CLR_CMAKE_TARGET_TIZEN_LINUX)
 endif()

 if(NOT DEFINED FEATURE_EVENT_TRACE)
-  set(FEATURE_EVENT_TRACE 1)
 endif(NOT DEFINED FEATURE_EVENT_TRACE)

 if(NOT DEFINED FEATURE_PERFTRACING AND FEATURE_EVENT_TRACE)

Doesn't seem to work too well out of the box. A ./build.sh produces:

  /home/omajid/devel/dotnet/runtime/src/coreclr/inc/profilepriv.h:388:34: error: unknown type name 'EventPipeProvider'
      void EventPipeEventDelivered(EventPipeProvider *provider, DWORD eventId, DWORD eventVersion, ULONG cbMetadataBlob, LPCBYTE metadataBlob, ULONG cbEventData,                                                                           
                                   ^                                                                                    /home/omajid/devel/dotnet/runtime/src/coreclr/inc/profilepriv.h:390:35: error: unknown type name 'EventPipeProvider'
      void EventPipeProviderCreated(EventPipeProvider *provider);                                                     
                                    ^
  /home/omajid/devel/dotnet/runtime/src/coreclr/vm/eetoprofinterfaceimpl.h:462:9: error: unknown type name 'EventPipeProvider'                                                                                                              
          EventPipeProvider *provider,                                                                                
          ^                                                                                                           
  /home/omajid/devel/dotnet/runtime/src/coreclr/vm/eetoprofinterfaceimpl.h:475:38: error: unknown type name 'EventPipeProvider'                                                                                                             
      HRESULT EventPipeProviderCreated(EventPipeProvider *provider);                                                  
                                       ^
  /home/omajid/devel/dotnet/runtime/src/coreclr/inc/profilepriv.h:388:34: error: unknown type name 'EventPipeProvider'
      void EventPipeEventDelivered(EventPipeProvider *provider, DWORD eventId, DWORD eventVersion, ULONG cbMetadataBlob, LPCBYTE metadataBlob, ULONG cbEventData,                                                                           
                                   ^                                                                                  
  /home/omajid/devel/dotnet/runtime/src/coreclr/inc/profilepriv.h:390:35: error: unknown type name 'EventPipeProvider'
      void EventPipeProviderCreated(EventPipeProvider *provider);                                                     
                                    ^                                           

And many similar errors.

am11 commented 2 years ago

Sounds like something has regressed since https://github.com/dotnet/runtime/pull/32746. :) This scenario is not tested in the CI (build on linux by disabling FEATURE_EVENT_TRACE). Perhaps we also need to delete: https://github.com/dotnet/runtime/blob/ad9979381706d1976e4d26e13cb88d6147264968/src/coreclr/clrdefinitions.cmake#L119-L121 and set this to false: https://github.com/dotnet/runtime/blob/ad9979381706d1976e4d26e13cb88d6147264968/src/coreclr/clr.featuredefines.props#L17

hoyosjs commented 2 years ago

Is disabling loading LTTNG support enough for now? COMPlus_LTTng=0

omajid commented 2 years ago

Is disabling loading LTTNG support enough for now? COMPlus_LTTng=0

It's good enough for me, personally. And it's enough for the source-build builds I care about. But end-users who upgrade to distributions that have lttng-ust (Fedora >= 36, Arch Linux, probably more soon) will run into crashes at runtime and will not even know about this workaround.

am11 commented 2 years ago

Unless someone is already working on it, I think we should stop pursuing "convert event generator to shim" part for now and focus on multi-version support first. It includes the following:

Conversion to shim would be nice, but it has many moving parts; even if we do the conversion first, it will probably not exempt the above steps.

tmds commented 2 years ago

Fedora 36 has picked up this version of lttng-ust and I am seeing crashes (in crossgen2!) there while trying to build .NET 6. Here's the backtrace from lldb:

I'm reproducing this issue.

  * frame #0: 0x00007f8885a47b82 liblttng-ust.so.1`check_event_provider + 162
    frame #1: 0x00007f8885a4d4d1 liblttng-ust.so.1`lttng_ust_probe_register + 33
    frame #2: 0x00007f8885b007b5 libcoreclrtraceptprovider.so`lttng_ust__events_init__DotNETRuntime() at ust-tracepoint-event.h:1198:14
    frame #3: 0x00007f888683fa2e ld-linux-x86-64.so.2`call_init(l=<unavailable>, argc=10, argv=0x00007ffcd00cfd88, env=0x00007ffcd00cfde0) at dl-init.c:70:3
    frame #4: 0x00007f888683fb1c ld-linux-x86-64.so.2`_dl_init(main_map=0x0000556bd608a290, argc=10, argv=0x00007ffcd00cfd88, env=0x00007ffcd00cfde0) at dl-init.c:117:5
    frame #5: 0x00007f88864534c5 libc.so.6`_dl_catch_exception + 229
    frame #6: 0x00007f88868437de ld-linux-x86-64.so.2`dl_open_worker at dl-open.c:821:5
    frame #7: 0x00007f8886453468 libc.so.6`_dl_catch_exception + 136
    frame #8: 0x00007f8886843b5c ld-linux-x86-64.so.2`_dl_open at dl-open.c:896:17
    frame #9: 0x00007f888638294c libc.so.6`dlopen_doit + 92
    frame #10: 0x00007f8886453468 libc.so.6`_dl_catch_exception + 136
    frame #11: 0x00007f8886453533 libc.so.6`_dl_catch_error + 51
    frame #12: 0x00007f888638244e libc.so.6`_dlerror_run + 142
    frame #13: 0x00007f88863829d8 libc.so.6`dlopen@GLIBC_2.2.5 + 72
    frame #14: 0x00007f8885fd6893 libcoreclr.so`PAL_InitializeTracing() at tracepointprovider.cpp:116:9
    frame #15: 0x00007f888683fa2e ld-linux-x86-64.so.2`call_init(l=<unavailable>, argc=10, argv=0x00007ffcd00cfd88, env=0x00007ffcd00cfde0) at dl-init.c:70:3
    frame #16: 0x00007f888683fb1c ld-linux-x86-64.so.2`_dl_init(main_map=0x0000556bd6060050, argc=10, argv=0x00007ffcd00cfd88, env=0x00007ffcd00cfde0) at dl-init.c:117:5
    frame #17: 0x00007f88864534c5 libc.so.6`_dl_catch_exception + 229
    frame #18: 0x00007f88868437de ld-linux-x86-64.so.2`dl_open_worker at dl-open.c:821:5
    frame #19: 0x00007f8886453468 libc.so.6`_dl_catch_exception + 136
    frame #20: 0x00007f8886843b5c ld-linux-x86-64.so.2`_dl_open at dl-open.c:896:17
    frame #21: 0x00007f888638294c libc.so.6`dlopen_doit + 92
    frame #22: 0x00007f8886453468 libc.so.6`_dl_catch_exception + 136
    frame #23: 0x00007f8886453533 libc.so.6`_dl_catch_error + 51
    frame #24: 0x00007f888638244e libc.so.6`_dlerror_run + 142
    frame #25: 0x00007f88863829d8 libc.so.6`dlopen@GLIBC_2.2.5 + 72
    frame #26: 0x00007f8886274ead libhostpolicy.so`pal::load_library(path="/home/tmds/rpmbuild/BUILD/dotnet-9e8b04bbff820c93c142f99a507a46b976f5c14c-x64-bootstrap/src/aspnetcore.ae1a6cbe225b99c0bf38b7e31bf60cb653b73a52/artifacts/source-build/self/package-cache/microsoft.netcore.app.crossgen2.linux-x64/6.0.0/tools/libcoreclr.so", dll=0x00007f888629e0a0) at pal.unix.cpp:230:12
...

I think the build works fine while using the prebuilts that link against liblttng-ust.so.0: It silently fails to load libcoreclrtraceptprovider.so.

In the stacktrace we see liblttng-ust.so.1 is loaded. I think these are binaries that got built against that. The SIGSEGV occurs in check_event_provider which is meant to validate the tracepoints. This may a regression in LTTng.

tmds commented 2 years ago

I can reproduce the issue by invoking corerun on a hello world console app.

The crash happens at this line: https://github.com/lttng/lttng-ust/blob/4c155a06d838e1ab5d385abd1d73ae56e71b7d5e/src/lib/lttng-ust/lttng-probes.c#L153.

The field is null.

(gdb) p *tp_class
$3 = {struct_size = 48, fields = 0x7ffff73ab2e0 <lttng_ust__event_fields___DotNETRuntime___GCStart>, nr_fields = 2, 
  probe_callback = 0x7ffff7364820 <lttng_ust__event_probe__DotNETRuntime___GCStart(void*, unsigned int, unsigned int)>, 
  signature = 0x7ffff738e720 <__tp_event_signature___DotNETRuntime___GCStart> "const unsigned int, Count, const unsigned int, Reason", probe_desc = 0x7ffff73a1470 <lttng_ust__probe_desc___DotNETRuntime>}
(gdb) p tp_class->fields[0]
$4 = (const struct lttng_ust_event_field * const) 0x0
(gdb) p tp_class->fields[1]
$5 = (const struct lttng_ust_event_field * const) 0x0

GCStart has 2 fields as we see in definition below, but the array has null pointers for them.

#define GCStart_TRACEPOINT_ARGS \
TP_ARGS( \
        const unsigned int, Count, \
        const unsigned int, Reason \
)
TRACEPOINT_EVENT_CLASS(
    DotNETRuntime,
    GCStart,
    GCStart_TRACEPOINT_ARGS,
     TP_FIELDS(
        ctf_integer(unsigned int, Count, Count)
        ctf_integer(unsigned int, Reason, Reason)
    )
)
tmds commented 2 years ago

The field is null.

These fields get initialized dynamically.

static const struct lttng_ust_event_field * const lttng_ust__event_fields___DotNETRuntime___GCStart[] = { new (const struct lttng_ust_event_field) { .struct_size = sizeof(struct lttng_ust_event_field), .name = "Count", .type = ((struct lttng_ust_type_common *) new (struct lttng_ust_type_integer) { .parent = { .type = lttng_ust_type_integer, }, .struct_size = sizeof(struct lttng_ust_type_integer), .size = sizeof(unsigned int) * 8, .alignment = 1 * 8, .signedness = (std::is_signed<unsigned int>::value), .reverse_byte_order = 1234 != 1234, .base = 10, }), .nowrite = 0, .nofilter = 0, }, new (const struct lttng_ust_event_field) { .struct_size = sizeof(struct lttng_ust_event_field), .name = "Reason", .type = ((struct lttng_ust_type_common *) new (struct lttng_ust_type_integer) { .parent = { .type = lttng_ust_type_integer, }, .struct_size = sizeof(struct lttng_ust_type_integer), .size = sizeof(unsigned int) * 8, .alignment = 1 * 8, .signedness = (std::is_signed<unsigned int>::value), .reverse_byte_order = 1234 != 1234, .base = 10, }), .nowrite = 0, .nofilter = 0, }, new (const struct lttng_ust_event_field) { .struct_size = sizeof(struct lttng_ust_event_field), .name = "dummy", .type = ((struct lttng_ust_type_common *) new (struct lttng_ust_type_integer) { .parent = { .type = lttng_ust_type_integer, }, .struct_size = sizeof(struct lttng_ust_type_integer), .size = sizeof(int) * 8, .alignment = 1 * 8, .signedness = (std::is_signed<int>::value), .reverse_byte_order = 1234 != 1234, .base = 10, }), .nowrite = 0, .nofilter = 0, }, }; static const struct lttng_ust_tracepoint_class lttng_ust__event_class___DotNETRuntime___GCStart = { .struct_size = sizeof(struct lttng_ust_tracepoint_class), .fields = lttng_ust__event_fields___DotNETRuntime___GCStart, .nr_fields = (sizeof(lttng_ust__event_fields___DotNETRuntime___GCStart) / sizeof((lttng_ust__event_fields___DotNETRuntime___GCStart)[0])) - 1, .probe_callback = (void (*)(void)) &lttng_ust__event_probe__DotNETRuntime___GCStart, .signature = __tp_event_signature___DotNETRuntime___GCStart, .probe_desc = &lttng_ust__probe_desc___DotNETRuntime, };

It seems they have not been initialized (yet):

(gdb) p lttng_ust__event_fields___DotNETRuntime___GCStart
$1 = {0x0, 0x0, 0x0}
am11 commented 2 years ago

I think the build works fine

binary built against liblttng-ust.so.0 can not work against .so.1 without running into validation errors; per the release notes of ABI breaking change.

tmds commented 2 years ago

binary built against liblttng-ust.so.0 can not work against .so.1 without running into validation errors; per the release notes of ABI breaking change.

Yes, but it doesn't get used.

Because libcoreclrtraceptprovider.so is linked against liblttng-ust.so.0, and that is not available, the library silently fails to load here:

https://github.com/dotnet/runtime/blob/b0159122f3570decd8ced6228681a210e2711de6/src/coreclr/pal/src/misc/tracepointprovider.cpp#L120-L122

am11 commented 2 years ago

Yes, in addition to updating PAL (adding version probing at run-time), I think we would also need two versions of libcoreclrtraceptprovider .so.0 and .so.0 to cover the supported matrix. The version selection part is yet to be implemented. It is the same situation as we once had with openssl, where we used to require 1.0 until later version support came along.

lttng is a bit tricky as we use their macro based APIs and their headers expand those macros to dlopen (with hardcoded SO version) calls at build time, so we would need both set of headers on the build machine.

That's why I think best way is to have those headers in our source tree under src/native/external/lttng (where we were thinking to eventually move rest of the similar vendor code in the repo).

tmds commented 2 years ago

This issue is about adding support for both versions. I've created a separate issue for the crash we see on Fedora 36: https://github.com/dotnet/runtime/issues/62398.

am11 commented 2 years ago

This issue is about adding support for both versions.

Reason for adding this support is because .NET apps started to crash on systems with newer lttng library. If #62398 is about a temporary fix like if NOT dlopen(liblttng-ust.so.0) then do what DOTNET_LTTng=0 does to disable eventing, then it makes sense we can probably backport that to 6.0 until the multi-version support arrives.

tmds commented 2 years ago

If #62398 is about

That is about .NET crashing when built against liblttng-ust.so.1 on Fedora 36.

.NET apps started to crash

From what I see, .NET built against liblttng-ust.so.0 doesn't crash on systems that have liblttng-ust.so.1.

$ dnf list installed lttng-ust
Installed Packages
lttng-ust.x86_64                                        2.13.0-1.fc36                                         @rawhide
$ ls /usr/lib64/liblttng-ust.so.1
/usr/lib64/liblttng-ust.so.1
$ which dotnet
~/dotnet-60/dotnet
$ ldd ~/dotnet-60/shared/Microsoft.NETCore.App/6.0.0/libcoreclrtraceptprovider.so
    linux-vdso.so.1 (0x00007ffe6f571000)
    liblttng-ust.so.0 => not found
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f6e3571d000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f6e3563d000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f6e35622000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f6e35419000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f6e359fb000)
$ dotnet new console -o /tmp/console
The template "Console App" was created successfully.

Processing post-creation actions...
Running 'dotnet restore' on /tmp/console/console.csproj...
  Determining projects to restore...
  Restored /tmp/console/console.csproj (in 129 ms).
Restore succeeded.
$ dotnet run --project /tmp/console
Hello, World!
am11 commented 2 years ago

@tmds, point is liblttng-ust.so.1 is not supported, needs work, we would need to support both versions..

if NOT dlopen(liblttng-ust.so.0) then do what DOTNET_LTTng=0 does to disable eventing

this should fix the crash you are seeing. Besides, the crash you are seeing is same as #60407 which was closed as a duplicate.

tmds commented 2 years ago

For source-built .NET we care more about https://github.com/dotnet/runtime/issues/62398 than we do about this issue. And the fix for https://github.com/dotnet/runtime/issues/62398 should be backported to 6.0, while this feature might not.

am11 commented 2 years ago

Source build has patching mechanism in place, you can disable lttng feature at build time as discussed above.

Livius90 commented 1 month ago

Is it still not resolved after 3 years?