Open runfoapp[bot] opened 3 years ago
Runfo Tracking Issue: tracing\eventpipe\diagnosticport\diagnosticport\diagnosticport.cmd Intermittent Failure in CI | Build | Definition | Kind | Run Name | Console | Core Dump | Test Results | Run Client |
---|---|---|---|---|---|---|---|---|
1269199 | runtime | PR 55604 | CoreCLR windows x86 Checked no_tiered_compilation @ Windows.10.Amd64.Open | console.log | core dump | runclient.py |
Build Result Summary | Day Hit Count | Week Hit Count | Month Hit Count |
---|---|---|---|
0 | 0 | 1 |
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.
@josalem
Did some preliminary triage on this I wanted to document:
All the failures are on win-x64 and present with the following failure in the logs:
4.9s: ::== Running test: TEST_SuspendDefaultPort
5.0s: running sub-process: C:\h\w\C3B00A3D\p\CoreRun.exe C:\h\w\C3B00A3D\w\A43B08DC\e\tracing\eventpipe\diagnosticport\diagnosticport\diagnosticport.dll 0
5.0s: subprocess started: True
5.0s: subprocess PID: 5428
5.0s: Standard Diagnostics Server connection not created yet -> try again in 100 ms
5.3s: Starting EventPipeSession over standard connection
5.3s: Started EventPipeSession over standard connection with session id: 0x1b3a7d660c0
5.3s: Runtime HAS NOT resumed (expects: true): True
5.3s: Sent: { Header={ Magic=System.Byte[]; Size=20; CommandSet=4; CommandId=1; Reserved=0 }; }
5.5s: Creating EventPipeEventSource
5.6s: Received: { Header={ Magic=System.Byte[]; Size=24; CommandSet=255; CommandId=0; Reserved=0 }; Payload=[ 0x00 0x00 0x00 0x00 ]}
5.6s: Waiting for EEStartupStarted event
5.8s: Created EventPipeEventSource
5.8s: Starting processing
The first failure is from October 29th, 2020 which is a full 2 months after that test was added and the day after the new managed thread pool was merged in.
I've had 3 instances of TEST_SuspendDefaultPort
running in a loop ~for several hours~ overnight with processor affinity set to 0x1
. I haven't been able to reproduce the issue locally.
This test is specifically checking the functionality of setting DOTNET_SuspendDefaultPort=1
, which causes the default listen Diagnostic Port to be set to suspend mode. This suspends runtime startup in EEStartupHelper and waits for a ResumeRuntime command to come on the default Port. The test sets up an EventPipeEventSource
that watches for the EEStartupStarted
event (the first event fired in the runtime) and sets a ManualResetEvent
.
The logs indicate that the ResumeRuntime command was sent and the runtime returned the OK response, but the MRE hasn't been set, meaning we haven't seen the event yet.
That means there are two possible errors happening:
EEStartupStarted
event isn't being firedResumeRuntime
commandThere are dumps of both the parent and child processes from these failures, but neither seem to be in good condition.
When I open them in windbg I see suspicious stacks:
I think there may be issues with the DAC in th3e artifacts from the build I got the dumps from.
Opening them in VS gives a better view, but the main thread stack is truncated:
This is the child process, which should have sent the EEStartupStarted
event. The parallel stacks view in VS implies the runtime did in fact resume. The main thread is not in EEStartupHelper
which is where it would be stuck if the runtime didn't resume. Instead, it is in ReadFile
with what appers to be a truncated managed stack under it. My hunch is this is the top of the call to Console.ReadLine
from the test code: https://github.com/dotnet/runtime/blob/6f29f0be4f699d6aacde916e3ce15ff0f90c48ea/src/tests/tracing/eventpipe/diagnosticport/diagnosticport.cs#L384
The evidence points to the EEStartupStarted
event sometimes not firing. I'm not sure how that could be the case just yet, but I'm inclined to think the managed threadpool might have changed things somehow. This test started failing intermittently right after that was merged in. I'll focus further investigation on that hypothesis with the hopes that I can reproduce it locally.
I've left the test running in three consoles restricted to one core for >6 days and the issue has not reproduced locally. Since we are getting core dumps from the failure in CI, I'll look at turning on StressLog in these tests to see if I can get more information.
Managed to validate with a different dump that the test is indeed waiting inside the Console.ReadLine()
from https://github.com/dotnet/runtime/blob/a7c31798739670fae93e962d1158439a83a118cf/src/tests/tracing/eventpipe/diagnosticport/diagnosticport.cs#L384
0:000> !clrstack
OS Thread Id: 0x1c5c (0)
Child SP IP Call Site
000000959357CFA8 00007ffe324c5b04 [InlinedCallFrame: 000000959357cfa8] Interop+Kernel32.ReadFile(IntPtr, Byte*, Int32, Int32 ByRef, IntPtr)
000000959357CFA8 00007ffda4a5e16a [InlinedCallFrame: 000000959357cfa8] Interop+Kernel32.ReadFile(IntPtr, Byte*, Int32, Int32 ByRef, IntPtr)
000000959357CF70 00007ffda4a5e16a ILStubClass.IL_STUB_PInvoke(IntPtr, Byte*, Int32, Int32 ByRef, IntPtr)
000000959357D0B0 00007ffda4a61d1a System.ConsolePal+WindowsConsoleStream.ReadFileNative(IntPtr, System.Span`1, Boolean, Int32 ByRef, Boolean) [/_/src/libraries/System.Console/src/System/ConsolePal.Windows.cs @ 1165]
000000959357D170 00007ffda4a61bc7 System.ConsolePal+WindowsConsoleStream.Read(System.Span`1) [/_/src/libraries/System.Console/src/System/ConsolePal.Windows.cs @ 1123]
000000959357D1F0 00007ffda4a61a90 System.IO.ConsoleStream.Read(Byte[], Int32, Int32) [/_/src/libraries/System.Console/src/System/IO/ConsoleStream.cs @ 34]
000000959357D260 00007ffe027d534c System.IO.StreamReader.ReadBuffer()
000000959357D2C0 00007ffe027d581d System.IO.StreamReader.ReadLine()
000000959357D310 00007ffda4a6185f System.IO.SyncTextReader.ReadLine() [/_/src/libraries/System.Console/src/System/IO/SyncTextReader.cs @ 76]
000000959357D370 00007ffda4a5c26d diagnosticport.dll!Unknown
000000959357D430 00007ffda4a59982 System.Private.CoreLib.dll!Unknown
000000959357D490 00007ffda4a598f9 diagnosticport.dll!Unknown
000000959357D4E0 00007ffda4a59861 diagnosticport.dll!Unknown
Since we're in managed code, that means the runtime resumed already. This implies that the EEStartupStarted event was either not fired or not received.
Added blocking-clean-ci
as it I just hit in on a PR.
@josalem it seems like you're actively investigating, if you think it the fix will take some time still (more than a couple of hours) can we disable the test on the meantime?
I'm hesitant to turn it off since it only seems to happen in CI, but we can if it is blocking things. I can disable it in the morning. I'm actively looking into this one.
I went digging through the session state for the app and it's in a weird state:
Ignore the m_numEventsWritten
value; it isn't modified anywhere in the code, so it should be 0.
The odd thing is that m_numEventsStored
is set to 2, indicating that 2 events were written and neither required a new buffer to be allocated. Presumably there was a buffer there because when I dig further, I find that the EventPipeSessionState
shows an empty BufferList
, but the m_LastReadSequenceNumber
is 2:
Looking into the buffer manager logic, we should only ever deallocate buffers if it is empty or the session ended.
Looking at a couple instances of the issue, the symptom appears to be the same. The runtime is successfully resumed and the EEStartupStart
events is fired, but the EventPipeEventSource
in the test harness never sees it. This only happens sometimes on Windows x64 Checked and only since October 29th, 2 months after the test went in.
CC @sywhang
@safern I will follow up with a PR to disable this test now. @josalem lmk if you need help investigating it.
Jotting down a few notes:
EventPipeEventSource
knows the writer app's PID, and EventPipeEventSource..ctor()
returned meaning that the trace header was at least received. It is difficult to determine what else has been received. The EventCache
is empty, but the StackCache
isn't. The reader also saw the MetadataBlock
containing the metadata for the event we expect to see. The internal stream reader seems to be partway through reading an EventBlock
at the time of the dump. It is blocked in a read on the Named Pipe. The runtime shouldn't be sending partial EventBlock
s, so this might indicate an angle to investigate. It is possible that the EventBlock
being read contains the event we are blocked waiting on.EventPipeBuffer*
logic tripping and not performing certain actions like flush, tag events as oldest, or send sequence points. Any of those 3 failing to happen could cause the reader to not dispatch an event even if received.A little more progress.
The EventPipeEventSource
is in the middle of reading an EventBlock
from the stream. According to the internal variables for the reader, it has already read the entirety of the EventBlock
into the buffer. Now it is attempting to do a 6 byte read to get to an aligned address.
I looked in the internal buffer for the reader and decoded the EventBlock
. It contains the event the test is waiting for:
0:007> db 000001920011a068 L31
00000192`0011a068 14 00 01 00 4d 47 e6 dc-04 00 00 00 66 97 34 dd ....MG......f.4.
00000192`0011a078 04 00 00 00 cf 01 00 dc-38 ff ff ff ff 0f dc 38 ........8......8
00000192`0011a088 01 cd 8e 99 e7 4d 02 01-00 01 02 99 a0 b9 02 01 .....M..........
00000192`0011a098 00
Header Size: 0x0014 (0n20)
Flags: 0x0001 (0n1) (Variable Length Encoded Compressed Headers)
MinTimestamp: 0x00000004dce6474d (0n20_885_948_237)
MaxTimestamp: 0x00000004dd349766 (0n20_891_080_550)
Event with compression:
Header:
Flags byte: 0xcf (1100`1111)
MetadataID: 0x01 (This is EEStartupStart)
Sequence#: 0x00 (increment by 1) => 1
CaptureTID: 0xdc (1101 1100) + 0x38 (0011 1000) ==> 0x1c5c (0001 1100 0101 1100)
ProcessNum: ff ff ff ff 0f => 1111`1111 1111`1111 1111`1111 1111`1111 (0n4294967295 => -1)
ThreadID: dc 38 => 0x1c5c
StackID: 01
TimeStamp: cd 8e 99 e7 4d => 10011011100111001100100011101001101 => 20885948237
ActivityID: N/A
RelActID: N/A
IsSorted: (0xcf & 64) => 1
PayloadSize: 0x2
EventPayload:
0x0001
Event with compression:
Header:
Flag: 0x01
MetadataID: 0x02 (Didn't validate, but this is probably EEStartupEnd)
SequenceNum: previous + 1 => 2
CaptureTID: same as previous => 0x1c5c
ProcessNum: same as previous => -1
TID: same as previous => 0x1c5c
StackId: same as previous => 0x01
TimeStamp: 99 a0 b9 02 (didn't convert)
ActivityID: N/A
RelActID: N/A
PayloadSize: 0x01
EventPayload:
0x00
This means that the reader is going to be stuck in this 6 byte read indefinitely because the writer app won't be sending any more events/data. This also means that the received events won't get dispatched which means our test will remain waiting for them.
Why is this intermittent? EventBlob
s are variable length encoded, meaning that a given sequence of events will have a non-deterministic size. More importantly, this means that while an EventPipeBlock
might start with padding to be aligned, it won't necessarily end aligned. Specifically, the length of the Block
isn't guaranteed to be 4 or 8 byte aligned, the latter of which is expected by EventPipeEventSource
.
Why did it not start failing till Oct 29th? This one is harder to rationalize with this hypothesis. It is possible that a difference in TSC speeds or values between Intel and AMD caused the EventBlock
s to be problematic lengths more often, but that would be difficult to prove. It is also possible that the test was intermittently failing from before Oct 29th and the AzDO data simply wasn't reflecting that. In either case, I think the timing of this test failing might be a red herring. I'm inclined to think it was the former of those two guesses.
I am fairly confident that this means the runtime is actually behaving correctly. The StressLogs show that the server is in a good state at the timeout and that the runtime resumed correctly. The dump shows that EventPipe correctly sent the 2 expected events and is sleeping until more events were sent. The reader app has the full EventBlock
containing these events.
The issue is that the event never get dispatched due to the attempt to get to an 8 bit offset into the internal buffer. I believe the issue is in TraceEvent and not in the runtime.
This is still a hypothesis, but my confidence in this is growing. Especially because the logic we are blocked in has the following comment on it:
// PERF policy
// In the streaming (non-seekable) case we don't want to buffer any more data than was
// requested and needed for alignment because this might cause the thread to block waiting
// for the unneeded data to arrive. There is probably a better way to do this that can
// oportunistically buffer if the data is available but this code isn't that sophisticated
// yet.
//
// In the non-streaming (seekable) case we do want to buffer because that lets the
// reader achieve higher throughput.
int fillSize = inputStream.CanSeek ? bytes.Length : (position + minimum + (align-1)) & ~(align-1);
I think I could validate this by splicing together a trace by hand that has an EventBlock
(s) that causes EventPipeEventSource
to get in this state.
CC @brianrob @sywhang @noahfalk @tommcdon
Confirmed my hypothesis about the cause of the hang. I doctored a nettrace stream to have an EventBlock with a size not divisible by 8. Feeding this EventPipeEventSource
causes it to hang at the spot in my previous comment waiting for more data before doing the event dispatch. The reason this behavior started abruptly is still unexplained but seems unimportant. Regardless of the reason, the fix will need to happen in TraceEvent.
Failures since bandaid fix all appear unrelated and are due to PRs testing linking (which is failing the test since they rely on EventSource
).
New failure is not a recurrence of the original issue. It is a failure in a PR to changes in floating point number parsing.