tracing\\eventpipe\\diagnosticport\\diagnosticport\\diagnosticport.cmd Intermittent Failure in CI

runfoapp[bot] commented 3 years ago

Runfo Tracking Issue: tracing\eventpipe\diagnosticport\diagnosticport\diagnosticport.cmd Intermittent Failure in CI	Build	Definition	Kind	Run Name	Console	Core Dump	Test Results	Run Client
1269199	runtime	PR 55604	CoreCLR windows x86 Checked no_tiered_compilation @ Windows.10.Amd64.Open	console.log	core dump	runclient.py

Build Result Summary	Day Hit Count	Week Hit Count	Month Hit Count
0	0	1

Dotnet-GitSync-Bot commented 3 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 3 years ago

Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.

tommcdon commented 3 years ago

@josalem

josalem commented 3 years ago

Did some preliminary triage on this I wanted to document:

All the failures are on win-x64 and present with the following failure in the logs:

          4.9s: ::== Running test: TEST_SuspendDefaultPort
          5.0s: running sub-process: C:\h\w\C3B00A3D\p\CoreRun.exe C:\h\w\C3B00A3D\w\A43B08DC\e\tracing\eventpipe\diagnosticport\diagnosticport\diagnosticport.dll 0
          5.0s: subprocess started: True
          5.0s: subprocess PID: 5428
          5.0s: Standard Diagnostics Server connection not created yet -> try again in 100 ms
          5.3s: Starting EventPipeSession over standard connection
          5.3s: Started EventPipeSession over standard connection with session id: 0x1b3a7d660c0
          5.3s: Runtime HAS NOT resumed (expects: true): True
          5.3s: Sent: { Header={ Magic=System.Byte[]; Size=20; CommandSet=4; CommandId=1; Reserved=0 }; }
          5.5s: Creating EventPipeEventSource
          5.6s: Received: { Header={ Magic=System.Byte[]; Size=24; CommandSet=255; CommandId=0; Reserved=0 }; Payload=[ 0x00 0x00 0x00 0x00  ]}
          5.6s: Waiting for EEStartupStarted event
          5.8s: Created EventPipeEventSource
          5.8s: Starting processing

The first failure is from October 29th, 2020 which is a full 2 months after that test was added and the day after the new managed thread pool was merged in.

I've had 3 instances of TEST_SuspendDefaultPort running in a loop ~for several hours~ overnight with processor affinity set to 0x1. I haven't been able to reproduce the issue locally.

This test is specifically checking the functionality of setting DOTNET_SuspendDefaultPort=1, which causes the default listen Diagnostic Port to be set to suspend mode. This suspends runtime startup in EEStartupHelper and waits for a ResumeRuntime command to come on the default Port. The test sets up an EventPipeEventSource that watches for the EEStartupStarted event (the first event fired in the runtime) and sets a ManualResetEvent.

The logs indicate that the ResumeRuntime command was sent and the runtime returned the OK response, but the MRE hasn't been set, meaning we haven't seen the event yet.

That means there are two possible errors happening:

The EEStartupStarted event isn't being fired
The runtime didn't actually resume after the ResumeRuntime command

There are dumps of both the parent and child processes from these failures, but neither seem to be in good condition.

When I open them in windbg I see suspicious stacks:

I think there may be issues with the DAC in th3e artifacts from the build I got the dumps from.

Opening them in VS gives a better view, but the main thread stack is truncated:

This is the child process, which should have sent the EEStartupStarted event. The parallel stacks view in VS implies the runtime did in fact resume. The main thread is not in EEStartupHelper which is where it would be stuck if the runtime didn't resume. Instead, it is in ReadFile with what appers to be a truncated managed stack under it. My hunch is this is the top of the call to Console.ReadLine from the test code: https://github.com/dotnet/runtime/blob/6f29f0be4f699d6aacde916e3ce15ff0f90c48ea/src/tests/tracing/eventpipe/diagnosticport/diagnosticport.cs#L384

The evidence points to the EEStartupStarted event sometimes not firing. I'm not sure how that could be the case just yet, but I'm inclined to think the managed threadpool might have changed things somehow. This test started failing intermittently right after that was merged in. I'll focus further investigation on that hypothesis with the hopes that I can reproduce it locally.

josalem commented 3 years ago

I've left the test running in three consoles restricted to one core for >6 days and the issue has not reproduced locally. Since we are getting core dumps from the failure in CI, I'll look at turning on StressLog in these tests to see if I can get more information.

josalem commented 3 years ago

Managed to validate with a different dump that the test is indeed waiting inside the Console.ReadLine() from https://github.com/dotnet/runtime/blob/a7c31798739670fae93e962d1158439a83a118cf/src/tests/tracing/eventpipe/diagnosticport/diagnosticport.cs#L384

0:000> !clrstack
OS Thread Id: 0x1c5c (0)
        Child SP               IP Call Site
000000959357CFA8 00007ffe324c5b04 [InlinedCallFrame: 000000959357cfa8] Interop+Kernel32.ReadFile(IntPtr, Byte*, Int32, Int32 ByRef, IntPtr)
000000959357CFA8 00007ffda4a5e16a [InlinedCallFrame: 000000959357cfa8] Interop+Kernel32.ReadFile(IntPtr, Byte*, Int32, Int32 ByRef, IntPtr)
000000959357CF70 00007ffda4a5e16a ILStubClass.IL_STUB_PInvoke(IntPtr, Byte*, Int32, Int32 ByRef, IntPtr)
000000959357D0B0 00007ffda4a61d1a System.ConsolePal+WindowsConsoleStream.ReadFileNative(IntPtr, System.Span`1, Boolean, Int32 ByRef, Boolean) [/_/src/libraries/System.Console/src/System/ConsolePal.Windows.cs @ 1165]
000000959357D170 00007ffda4a61bc7 System.ConsolePal+WindowsConsoleStream.Read(System.Span`1) [/_/src/libraries/System.Console/src/System/ConsolePal.Windows.cs @ 1123]
000000959357D1F0 00007ffda4a61a90 System.IO.ConsoleStream.Read(Byte[], Int32, Int32) [/_/src/libraries/System.Console/src/System/IO/ConsoleStream.cs @ 34]
000000959357D260 00007ffe027d534c System.IO.StreamReader.ReadBuffer()
000000959357D2C0 00007ffe027d581d System.IO.StreamReader.ReadLine()
000000959357D310 00007ffda4a6185f System.IO.SyncTextReader.ReadLine() [/_/src/libraries/System.Console/src/System/IO/SyncTextReader.cs @ 76]
000000959357D370 00007ffda4a5c26d diagnosticport.dll!Unknown
000000959357D430 00007ffda4a59982 System.Private.CoreLib.dll!Unknown
000000959357D490 00007ffda4a598f9 diagnosticport.dll!Unknown
000000959357D4E0 00007ffda4a59861 diagnosticport.dll!Unknown

Since we're in managed code, that means the runtime resumed already. This implies that the EEStartupStarted event was either not fired or not received.

safern commented 3 years ago

Added blocking-clean-ci as it I just hit in on a PR.

@josalem it seems like you're actively investigating, if you think it the fix will take some time still (more than a couple of hours) can we disable the test on the meantime?

josalem commented 3 years ago

I'm hesitant to turn it off since it only seems to happen in CI, but we can if it is blocking things. I can disable it in the morning. I'm actively looking into this one.

I went digging through the session state for the app and it's in a weird state:

Ignore the m_numEventsWritten value; it isn't modified anywhere in the code, so it should be 0.

The odd thing is that m_numEventsStored is set to 2, indicating that 2 events were written and neither required a new buffer to be allocated. Presumably there was a buffer there because when I dig further, I find that the EventPipeSessionState shows an empty BufferList, but the m_LastReadSequenceNumber is 2:

Looking into the buffer manager logic, we should only ever deallocate buffers if it is empty or the session ended.

Looking at a couple instances of the issue, the symptom appears to be the same. The runtime is successfully resumed and the EEStartupStart events is fired, but the EventPipeEventSource in the test harness never sees it. This only happens sometimes on Windows x64 Checked and only since October 29th, 2 months after the test went in.

CC @sywhang

sywhang commented 3 years ago

@safern I will follow up with a PR to disable this test now. @josalem lmk if you need help investigating it.

josalem commented 3 years ago

Jotting down a few notes:

The writer app's StressLog indicates that the tested behavior is working, i.e., the runtime is suspended on start and is then resumed after the resume command (the first managed thread appears in the log after the command is received). The issue is that the EEStartupStart event isn't being received by the reader app, so it is waiting on the MRE. The reader appears to be in the middle of reading an EventBlock, while the writer app says it already wrote the event and is waiting for more events to be written.
The reader app's state indicates that it received data over EventPipe. The EventPipeEventSource knows the writer app's PID, and EventPipeEventSource..ctor() returned meaning that the trace header was at least received. It is difficult to determine what else has been received. The EventCache is empty, but the StackCache isn't. The reader also saw the MetadataBlock containing the metadata for the event we expect to see. The internal stream reader seems to be partway through reading an EventBlock at the time of the dump. It is blocked in a read on the Named Pipe. The runtime shouldn't be sending partial EventBlocks, so this might indicate an angle to investigate. It is possible that the EventBlock being read contains the event we are blocked waiting on.
The Windows10.Amd64.Open queue switched to 2 core AMD CPU VMs on October 28th which is when this test started failing intermittently. It had been in the repo for a long time before that without failures. The fact that this test has failed in release/5.0 branches implies the CPU manufacturer change is more likely an influence than the portable threadpool as I previously hypothesized. I'm not sure what a manufacturer change could do to intermittently break this scenario yet. One hypothesis is that a difference in TSC synchronizing between cores with AMD vs Intel could result in the EventPipeBuffer* logic tripping and not performing certain actions like flush, tag events as oldest, or send sequence points. Any of those 3 failing to happen could cause the reader to not dispatch an event even if received.

josalem commented 3 years ago

A little more progress.

The EventPipeEventSource is in the middle of reading an EventBlock from the stream. According to the internal variables for the reader, it has already read the entirety of the EventBlock into the buffer. Now it is attempting to do a 6 byte read to get to an aligned address.

I looked in the internal buffer for the reader and decoded the EventBlock. It contains the event the test is waiting for:

0:007> db 000001920011a068 L31
00000192`0011a068  14 00 01 00 4d 47 e6 dc-04 00 00 00 66 97 34 dd  ....MG......f.4.
00000192`0011a078  04 00 00 00 cf 01 00 dc-38 ff ff ff ff 0f dc 38  ........8......8
00000192`0011a088  01 cd 8e 99 e7 4d 02 01-00 01 02 99 a0 b9 02 01  .....M..........
00000192`0011a098  00    

Header Size:  0x0014 (0n20)
Flags:        0x0001 (0n1) (Variable Length Encoded Compressed Headers)
MinTimestamp: 0x00000004dce6474d (0n20_885_948_237)
MaxTimestamp: 0x00000004dd349766 (0n20_891_080_550)

Event with compression:
 Header:
  Flags byte:  0xcf (1100`1111)
  MetadataID:  0x01 (This is EEStartupStart)
  Sequence#:   0x00 (increment by 1) => 1
  CaptureTID:  0xdc (1101 1100) + 0x38 (0011 1000) ==> 0x1c5c (0001 1100 0101 1100)
  ProcessNum:  ff ff ff ff 0f => 1111`1111 1111`1111 1111`1111 1111`1111 (0n4294967295 => -1)
  ThreadID:    dc 38 => 0x1c5c
  StackID:     01
  TimeStamp:   cd 8e 99 e7 4d => 10011011100111001100100011101001101 => 20885948237
  ActivityID:  N/A
  RelActID:    N/A
  IsSorted:    (0xcf & 64) => 1
  PayloadSize: 0x2
 EventPayload:
  0x0001

Event with compression:
 Header:
  Flag:        0x01
  MetadataID:  0x02 (Didn't validate, but this is probably EEStartupEnd)
  SequenceNum: previous + 1 => 2
  CaptureTID:  same as previous => 0x1c5c
  ProcessNum:  same as previous => -1
  TID:         same as previous => 0x1c5c
  StackId:     same as previous => 0x01
  TimeStamp:   99 a0 b9 02 (didn't convert)
  ActivityID:  N/A
  RelActID:    N/A
  PayloadSize: 0x01
 EventPayload:
  0x00

This means that the reader is going to be stuck in this 6 byte read indefinitely because the writer app won't be sending any more events/data. This also means that the received events won't get dispatched which means our test will remain waiting for them.

Why is this intermittent? EventBlobs are variable length encoded, meaning that a given sequence of events will have a non-deterministic size. More importantly, this means that while an EventPipeBlock might start with padding to be aligned, it won't necessarily end aligned. Specifically, the length of the Block isn't guaranteed to be 4 or 8 byte aligned, the latter of which is expected by EventPipeEventSource.

Why did it not start failing till Oct 29th? This one is harder to rationalize with this hypothesis. It is possible that a difference in TSC speeds or values between Intel and AMD caused the EventBlocks to be problematic lengths more often, but that would be difficult to prove. It is also possible that the test was intermittently failing from before Oct 29th and the AzDO data simply wasn't reflecting that. In either case, I think the timing of this test failing might be a red herring. I'm inclined to think it was the former of those two guesses.

I am fairly confident that this means the runtime is actually behaving correctly. The StressLogs show that the server is in a good state at the timeout and that the runtime resumed correctly. The dump shows that EventPipe correctly sent the 2 expected events and is sleeping until more events were sent. The reader app has the full EventBlock containing these events.

The issue is that the event never get dispatched due to the attempt to get to an 8 bit offset into the internal buffer. I believe the issue is in TraceEvent and not in the runtime.

This is still a hypothesis, but my confidence in this is growing. Especially because the logic we are blocked in has the following comment on it:

                // PERF policy
                // In the streaming (non-seekable) case we don't want to buffer any more data than was
                // requested and needed for alignment because this might cause the thread to block waiting 
                // for the unneeded data to arrive. There is probably a better way to do this that can
                // oportunistically buffer if the data is available but this code isn't that sophisticated
                // yet.
                //
                // In the non-streaming (seekable) case we do want to buffer because that lets the
                // reader achieve higher throughput.
                int fillSize = inputStream.CanSeek ? bytes.Length : (position + minimum + (align-1)) & ~(align-1);

I think I could validate this by splicing together a trace by hand that has an EventBlock(s) that causes EventPipeEventSource to get in this state.

CC @brianrob @sywhang @noahfalk @tommcdon

josalem commented 3 years ago

Confirmed my hypothesis about the cause of the hang. I doctored a nettrace stream to have an EventBlock with a size not divisible by 8. Feeding this EventPipeEventSource causes it to hang at the spot in my previous comment waiting for more data before doing the event dispatch. The reason this behavior started abruptly is still unexplained but seems unimportant. Regardless of the reason, the fix will need to happen in TraceEvent.

josalem commented 3 years ago

Failures since bandaid fix all appear unrelated and are due to PRs testing linking (which is failing the test since they rely on EventSource).

josalem commented 3 years ago

New failure is not a recurrence of the original issue. It is a failure in a PR to changes in floating point number parsing.

dotnet / diagnostics

tracing\\eventpipe\\diagnosticport\\diagnosticport\\diagnosticport.cmd Intermittent Failure in CI #2044