dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.91k stars 4.64k forks source link

EventListenerThreadPool test failing in nativeaot outerloop runs #105556

Closed MichalStrehovsky closed 2 weeks ago

MichalStrehovsky commented 1 month ago
BEGIN EXECUTION
/datadisks/disk1/work/A6FD09A7/p/nativeaottest.sh /datadisks/disk1/work/A6FD09A7/w/A12B08E5/e/tracing/eventlistener/EventListenerThreadPool/ EventListenerThreadPool.dll ''
Test Failed: Did not see all of the expected events.
ThreadPoolIOPack: 0
ThreadPoolIOEnqueue: 0
ThreadPoolIODequeue: 0
Xunit.Sdk.EqualException: Assert.Equal() Failure: Values differ
Expected: 100
Actual:   -1
   at Xunit.Assert.Equal[T](T, T, IEqualityComparer`1) + 0x1a1
   at __GeneratedMainWrapper.Main() + 0x38
Expected: 100
Actual: 101
END EXECUTION - FAILED

This is consistently failing in the Pri0 outerloop runs on all architectures.

Last good run: https://dev.azure.com/dnceng-public/public/_build/results?buildId=745467&view=results First bad run: https://dev.azure.com/dnceng-public/public/_build/results?buildId=746036&view=results

From this I suspected https://github.com/dotnet/runtime/pull/103675 and a revert PR at #105543 confirmed the suspicion (notice Pri0 legs are green in the revert PR). Note that PR reverts two commits because there was a merge conflict I didn't bother resolving, but the other commit is from yesterday and this is failing for a week.

Cc @eduardo-vp @kouvel

dotnet-policy-service[bot] commented 1 month ago

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas See info in area-owners.md if you want to be subscribed.

MichalStrehovsky commented 1 month ago

Disabling the test in #105557 to get clean(er) outerloops again.

dotnet-policy-service[bot] commented 1 month ago

Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.

davmason commented 3 weeks ago

I helped @eduardo-vp looked in to this failure, it doesn't seem like it is specific to the event he added.

@LakshanF is this scenario (using an EventListener to listen to native runtime events) expected to work under NativeAOT?

When I debugged the failing test under NativeAOT I saw it fail in this callstack:

    EventListenerThreadPool.exe!S_P_CoreLib_System_Diagnostics_Tracing_EventPipePayloadDecoder__DecodePayload() Line 18 Unknown
    EventListenerThreadPool.exe!S_P_CoreLib_System_Diagnostics_Tracing_NativeRuntimeEventSource__ProcessEvent() Line 60 Unknown
    EventListenerThreadPool.exe!S_P_CoreLib_System_Diagnostics_Tracing_EventPipeEventDispatcher__DispatchEventsToEventListeners() Line 169  Unknown
    EventListenerThreadPool.exe!S_P_CoreLib_System_Diagnostics_Tracing_EventPipeEventDispatcher___c__DisplayClass12_0___StartDispatchTask_b__0() Line 139   Unknown
    EventListenerThreadPool.exe!S_P_CoreLib_System_Threading_ExecutionContext__RunInternal() Line 179   Unknown
    EventListenerThreadPool.exe!S_P_CoreLib_System_Threading_Tasks_Task__ExecuteWithThreadLocal() Line 2342 Unknown
    EventListenerThreadPool.exe!S_P_CoreLib_System_Threading_Thread__StartThread() Line 446 Unknown
    EventListenerThreadPool.exe!S_P_CoreLib_System_Threading_Thread__ThreadEntryPoint() Line 226    Unknown
    kernel32.dll!BaseThreadInitThunk()  Unknown
    ntdll.dll!00007ffc2e58af28()    Unknown

The failure is because m_eventData is not populated for NativeRuntimeEventSource and we use m_eventData as a template for how to parse the appropriate events coming from the runtime.

I was able to debug through and see the code for creating the metadata seems to be running as expected, except we don't actually have the metadata in memory after. The code in question is EventSource.CreateManifestAndDescriptors. We call AddEventDescriptor for the events here, but the change doesn't seem to actually happen to the original array after returning, even though it is passed by ref.

@MichalStrehovsky are you able to help determine if this is a bad codegen issue or some other? Maybe the debugger is not showing the right memory?

MichalStrehovsky commented 2 weeks ago

Since this reproes on all architectures and all OSes, I'd consider a codegen issue less likely.

We run many of the tracing tests in the src/tests/tracing suite so my assumption is that this works in general but then again I would not be able to tell a difference between an event source and a metric, so take it with a grain of salt. @LakshanF knows most of native AOT specific aspects of this and any known issues.

jkotas commented 2 weeks ago

I was able to debug through and see the code for creating the metadata seems to be running as expected

The code is not creating the metadata as expected.

The problem is that we are missing the metadata for the native-only events in NativeAOT CoreLib. #106641 has the fix.