Closed tannergooding closed 1 month ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti See info in area-owners.md if you want to be subscribed.
Author: | tannergooding |
---|---|
Assignees: | - |
Labels: | `area-System.Diagnostics.Tracing`, `untriaged`, `blocking-outerloop`, `Known Build Error` |
Milestone: | - |
Test output indicates that the test is crashing with a seg fault and createdump fails either because the process is gone or there is a signing/entitlement issue. @hoyosjs @mikem8361
tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.sh [FAIL]
[createdump] Invalid process id: task_for_pid(15770) FAILED (os/kern) failure (5)
[createdump] This failure may be because createdump or the application is not properly signed and entitled.
[createdump] Failure took 0ms
[createdump] waitpid() returned successfully (wstatus 0000ff00)
/private/tmp/helix/working/A8400938/w/B79B09BE/e/tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.sh: line 441: 15770 Segmentation fault: 11 (core dumped) $LAUNCHER $ExePath "${CLRTestExecutionArguments[@]}"
It's an entitlement issue. The regex is looking for a string that can hit on any osx coreclr crash. Adjusting the string since the issue is also not OSX specific.
Can we disable the test until it's fixed? UPD: https://github.com/dotnet/runtime/pull/81094
This should be fixed on https://github.com/dotnet/runtime/pull/86180. Please feel free to re-activate if the problem persists.
The test was re-enabled by #104099 and started crashing intermittently again. Disabled again by #104131. Please make sure to re-enable the test when resolving this issue.
Example of a crash
/root/helix/work/correlation/corerun -p System.Reflection.Metadata.MetadataUpdater.IsSupported=false -p System.Runtime.Serialization.EnableUnsafeBinaryFormatterSerialization=true eventsourceerror.dll '' 0.0s: ==TEST STARTING== 1.1s: Started sending sentinel events... 1.1s: Connecting to EventPipe... 1.7s: Creating EventPipeEventSource... 1.7s: EventPipeEventSource created 1.7s: Dynamic.All callback registered 1.7s: Running optional trace validator 1.7s: Finished running optional trace validator 1.7s: Starting stream processing... 2.1s: Saw sentinel event 2.1s: Starting event generating action... 2.1s: Stopped sending sentinel events 2.1s: Stopping event generating action 2.1s: Sending StopTracing command... 2.1s: Saw new provider 'IllegalTypesEventSource' 2.1s: Saw new provider 'Microsoft-DotNETCore-EventPipe' 2.2s: Saw new provider 'Microsoft-Windows-DotNETRuntimeRundown' [createdump] Gathering state for process 362 corerun [createdump] Crashing thread 016c signal 11 (000b) [createdump] Writing crash report to file /home/helixbot/dotnetbuild/dumps/coredump.362.dmp.crashreport.json [createdump] Crash report successfully written [createdump] Writing minidump with heap to file /home/helixbot/dotnetbuild/dumps/coredump.362.dmp [createdump] Written 128049152 bytes (31262 pages) to core file [createdump] Target process is alive [createdump] Dump successfully written in 190ms Expected: 100 Actual: 139 END EXECUTION - FAILED
Stacktrace of the above crash (read of invalid GC handle or GC handle table corruption):
(gdb) bt
#0 0xf768b656 in Object::GetGCSafeMethodTable (this=0x7) at /__w/1/s/src/coreclr/vm/object.h:449
#1 Object::ValidateInner (this=0x7, bDeep=1, bVerifyNextHeader=1, bVerifySyncBlock=1)
at /__w/1/s/src/coreclr/vm/object.cpp:549
#2 0xf768c86c in Object::Validate (this=0x7, bDeep=1, bVerifyNextHeader=1, bVerifySyncBlock=1)
at /__w/1/s/src/coreclr/vm/object.cpp:529
#3 OBJECTREF::OBJECTREF (this=0xf6afeab0, pObject=0x7) at /__w/1/s/src/coreclr/vm/object.cpp:1157
#4 0xf777104c in ObjectFromHandle (handle=0xf6c1119c) at /__w/1/s/src/coreclr/vm/gchandleutilities.h:42
#5 MarshalNative::GCHandleInternalGet (handle=0xf6c1119c) at /__w/1/s/src/coreclr/vm/marshalnative.cpp:395
culprit.txt coredump.362.dmp.crashreport.json
The dump grabbed from the sigsegv crash wasn't useful
0:002> k
# Child-SP RetAddr Call Site00 00000000 00000000 0x72
0:002> .lastevent
Last event: 16a.16c: Signal SIGSEGV (Segmentation fault) code SEGV_MAPERR (Address not mapped to object) at 0x7 (first/second chance not available)
debugger time: Tue Jul 16 16:13:54.909 2024 (UTC - 4:00)
0:002> .exr -1
ExceptionAddress: 00000007
ExceptionCode: e0534947 (Linux signal)
ExceptionFlags: 00000000
NumberParameters: 2
Parameter[0]: 0000000b
Parameter[1]: 00000001
0:002> .ecxr
Unable to get exception context, HRESULT 0x8000FFFF
From the crashreport, the crashing thread 0x16c corresponds to
Thread Id: 0x16c
Child SP IP Call Site
0xf7e19b20 0xf7b27614 libc.so.6!
0xf7e19b28 0xf7b919c0 libc.so.6!__wait4_time64
0xf7e19b90 0xf7a26254 libcoreclr.so!PROCCreateCrashDump(std::vector<char const*, std::allocator<char const*> >&, char*, int, bool) at /__w/1/s/src/coreclr/pal/src/thread/process.cpp:2308:22
0xf7e19bd8 0xf7a27054 libcoreclr.so!PROCCreateCrashDumpIfEnabled at /__w/1/s/src/coreclr/pal/src/thread/process.cpp:2526:9
0xf7e19c28 0xf79dbaf6 libcoreclr.so!invoke_previous_action(sigaction*, int, siginfo_t*, void*, bool) at /__w/1/s/src/coreclr/pal/src/exception/signal.cpp:422:13
0xf7e19c60 0xf79db080 libcoreclr.so!sigsegv_handler(int, siginfo_t*, void*) at /__w/1/s/src/coreclr/pal/src/exception/signal.cpp:0:5
0xf7e19c90 0xf7b371b0 libc.so.6!
which corresponds to waiting for the createdump child process to finish, which according to the logs, returned successfully waitpid() returned successfully (wstatus 00000000) WEXITSTATUS 0 WTERMSIG 0
.
From the first native thread, it looks like this crash is occuring during https://github.com/dotnet/runtime/blob/66ae90f3b7ec4f13fffcd71913eca0b45777e58c/src/tests/tracing/eventpipe/common/IpcTraceTest.cs#L307 which calls WaitAllCore
and WaitAllCoreBlocking
below
Thread Id: 0x16a
Child SP IP Call Site
0xff9a56e8 0xf7b27614 libc.so.6!
0xff9a56f0 0xf7b6347e libc.so.6!
0xff9a5728 0xf7b6355c libc.so.6!
0xff9a5738 0xf7b6570a libc.so.6!pthread_cond_wait
0xff9a57b8 0xf7a18d20 libcoreclr.so!CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) at /__w/1/s/src/coreclr/pal/src/synchmgr/synchmanager.cpp:489:17
0xff9a5818 0xf7a18430 libcoreclr.so!CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) at /__w/1/s/src/coreclr/pal/src/synchmgr/synchmanager.cpp:0:0
0xff9a5868 0xf7a1fb92 libcoreclr.so!CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, int, unsigned int, int, int) at /__w/1/s/src/coreclr/pal/src/synchmgr/wait.cpp:0:45
0xff9a5950 0xf7a203fc libcoreclr.so!WaitForMultipleObjectsEx at /__w/1/s/src/coreclr/pal/src/synchmgr/wait.cpp:204:5
0xff9a5998 0xf76c50b8 libcoreclr.so!Thread::DoAppropriateWaitWorker(int, void**, int, unsigned int, WaitMode, void*) at /__w/1/s/src/coreclr/vm/threads.cpp:0:12
0xff9a5a28 0xf76bf7c2 libcoreclr.so!Thread::DoAppropriateWait(int, void**, int, unsigned int, WaitMode, PendingSync*)::$_0::operator()(Thread::DoAppropriateWait(int, void**, int, unsigned int, WaitMode, PendingSync*)::__EEParam*) const at /__w/1/s/src/coreclr/vm/threads.cpp:3058:23
(inlined by) Thread::DoAppropriateWait(int, void**, int, unsigned int, WaitMode, PendingSync*) at /__w/1/s/src/coreclr/vm/threads.cpp:3073:5
0xff9a5aa0 0xf77db20c libcoreclr.so!CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) at /__w/1/s/src/coreclr/vm/synch.cpp:0:30
0xff9a5b00 0xf76c631a libcoreclr.so!Thread::Wait(CLREvent*, int, PendingSync*) at /__w/1/s/src/coreclr/vm/threads.cpp:3875:5
(inlined by) Thread::Block(int, PendingSync*) at /__w/1/s/src/coreclr/vm/threads.cpp:3829:13
0xff9a5b18 0xf76bdb4e libcoreclr.so!SyncBlock::Wait(int) at /__w/1/s/src/coreclr/vm/syncblk.cpp:0:0
0xff9a5bc8 0xf799ac16 libcoreclr.so!Monitor_Wait at /__w/1/s/src/coreclr/classlibnative/bcltype/objectnative.cpp:0:0
0xff9a5c28 0xef04c1f8 0xef04c1f8
0xff9a5c88 0xef05e692 System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken)
0xff9a5ce8 0xef07f324 System.Threading.Tasks.Task.WaitAllBlockingCore(System.Collections.Generic.List`1<System.Threading.Tasks.Task>, Int32, System.Threading.CancellationToken)
0xff9a5d20 0xef07edea System.Threading.Tasks.Task.WaitAllCore(System.ReadOnlySpan`1<System.Threading.Tasks.Task>, Int32, System.Threading.CancellationToken)
0xff9a5d98 0xf624b9ca Tracing.Tests.Common.IpcTraceTest.Validate(Boolean)
0xff9a5e30 0xf62488c4 Tracing.Tests.Common.IpcTraceTest.RunAndValidateEventCounts(System.Collections.Generic.Dictionary`2<System.String,Tracing.Tests.Common.ExpectedEventCount>, System.Action, System.Collections.Generic.List`1<Microsoft.Diagnostics.NETCore.Client.EventPipeProvider>, Int32, System.Func`2<Microsoft.Diagnostics.Tracing.EventPipeEventSource,System.Func`1<Int32>>, Boolean)
0xff9a5e68 0xf6248234 Tracing.Tests.EventSourceError.EventSourceError.TestEntryPoint()
0xff9a5e88 0xf6248042 __GeneratedMainWrapper.Main()
Theres still a discrepancy between IpcTraceTest.cs in dotnet/diagnostics and the version in dotnet/runtime. Is the version in dotnet/diagnostics more resilient, or any other obvious low-hanging improvements we could make to IpcTraceTest.cs in dotnet/runtime to catch/prevent these crashes? @noahfalk @davmason
Re-enabled eventsourceerror on non-arm32 unix platforms in https://github.com/dotnet/runtime/pull/105033
This crash reproduced in #105127 on osx-x64.
2.1s: Saw new provider 'Microsoft-Windows-DotNETRuntimeRundown'
[createdump] Gathering state for process 18485
[createdump] Crashing thread 333de signal 11 (000b)
[createdump] Writing crash report to file /cores/coredump.18485.dmp.crashreport.json
[createdump] Crash report successfully written
[createdump] Writing minidump with heap to file /cores/coredump.18485.dmp
[createdump] Written 391995656 bytes (95702 pages) to core file
[createdump] Target process is alive
4.7s: Stopping stream processing
4.7s: Dropped 0 events
[createdump] Dump successfully written in 2562ms
Expected: 100
Actual: 139
END EXECUTION - FAILED
It is the same problem as in https://github.com/dotnet/runtime/issues/80666#issuecomment-2196110727 (read of invalid GC handle or GC handle table corruption).
failed in outerloop: runtime-coreclr outerloop/20240721.3
Failed configs:
R2R-CG2 osx x64 Checked no_tiered_compilation @ OSX.1200.Amd64.Open
coreclr osx x64 Checked no_tiered_compilation @ OSX.1200.Amd64.Open
- tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd
Based solely on the console logs from the recent failures on OSX, it looks like the readerTask does complete, so it's just stopTask that is causing the crash, https://github.com/dotnet/runtime/blob/cb4d8a2f08c12e85361fec50a75582734217b603/src/tests/tracing/eventpipe/common/IpcTraceTest.cs#L301.
@davmason , do you happen to recall any recent modifications to the DiagnosticsClient EventPipeSession that could be causing crashes for specifically the eventsourceerror test?
do you happen to recall any recent modifications to the DiagnosticsClient EventPipeSession that could be causing crashes for specifically the eventsourceerror test?
This issue was opened 1.5 years ago. Based on the symptoms, it is an intermittent crash that can come and go due to unrelated changes in timing, etc.
The crash happens at this line: https://github.com/dotnet/runtime/blob/3ec6286e99524bb1a351e91d35c53e65dc053418/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventPipeEventProvider.cs#L65
We are crashing because the GCHandle was freed already. We are freeing the GCHandle here: https://github.com/dotnet/runtime/blob/3ec6286e99524bb1a351e91d35c53e65dc053418/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventPipeEventProvider.cs#L97 .
I do not see anything in the implementation of EventPipeInternal.DeleteProvider to guarantee there are no callback in flight when it returns. In fact, https://github.com/dotnet/runtime/blob/3ec6286e99524bb1a351e91d35c53e65dc053418/src/native/eventpipe/ep.c#L1317-L1322 explicit defers deleting the provider until the end of the tracing session. I think we may need to defer deleting of the GCHandle in a similar way.
@mdh1418 Does this make sense - could you please take it from here?
Thanks for pinpointing the crash @jkotas!
After some offline discussions with @noahfalk , I'm seeing that ETW has the behavior of blocking during EventProvider.Dispose() until all already in-flight callbacks are finished, so if any callbacks happen to be initiated after the EventListenersLock in EventProvider.Dispose(), then this p/invoke will block here at Interop.Advapi32.EventUnregister until the ETW command releases the ETW lock as described in the comment.
Would parity between ETW and EventPipe in that aspect be better than deferring deleting the GC handle, as in having EventPipeEventProvider
s Disposal blocked by in flight callbacks? Perhaps for EventPipe, we could invoke Unregister under the EventListenersLock since the original reason for not doing so was because of deadlocking with the ETW lock + EventListenersLock. (edit: since having managed locks do different behaviors for ETW/EventPipe would bring inconsistency).
I do not have an opinion. If you think that synchronous blocking would be better, it is fine with me.
EventPipe, we could invoke Unregister under the EventListenersLock
I think that the synchronous blocking would have to be in the unmanaged event pipe code. I do not see how you can use locks on the managed side to fix this.
Yeah, I want to do it unmanaged code because I want to match behavior between EtwEventProvider.Unregister() and EventPipeProvider.Unregister(). Both of them invoke a p/invoke to do the unregister work and I am recommending we give those p/invokes identical blocking behavior.
The attempt to introduce a new lock to coordinate in-flight callbacks with provider disposal broke concurrent callbacks. Still looking to block provider disposal with a weaker blocking behavior
Signal wait/set blocking implementation: https://github.com/dotnet/runtime/pull/106040
Build Information
Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=137937 Build error leg or test failing: Build / coreclr Pri1 Runtime Tests Run osx x64 checked / Send tests to Helix (Unix) Pull request: https://github.com/dotnet/runtime/pull/80626
Error Message
Fill the error message using known issues guidance.
Known issue validation
Build: :mag_right: Result validation: :warning: Validation could not be done without an Azure DevOps build URL on the issue. Please add it to the "Build: :mag_right:" line. Validation performed at: 7/21/2024 2:00:18 AM UTC
Report
Summary