dotnet / aspnetcore

ASP.NET Core is a cross-platform .NET framework for building modern cloud-based web applications on Windows, Mac, or Linux.
https://asp.net
MIT License
35.19k stars 9.93k forks source link

Quarantine `VerifyCountersFireWithCorrectValues` and `EventCountersAndMetricsValues` #57259

Open halter73 opened 1 month ago

halter73 commented 1 month ago

Failing Test(s)

Error Message

System.Threading.Tasks.TaskCanceledException : A task was canceled.

Stacktrace

```text at System.Threading.Channels.ChannelReader`1.ReadAllAsync(CancellationToken cancellationToken)+MoveNext() at System.Threading.Channels.ChannelReader`1.ReadAllAsync(CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource.GetResult() at System.Collections.Generic.AsyncEnumerableExtensions.FirstOrDefault[T](IAsyncEnumerator`1 values, Func`2 filter) in /_/src/Hosting/Hosting/test/Internal/AsyncEnumerableExtensions.cs:line 10 at Microsoft.AspNetCore.Hosting.Tests.HostingApplicationDiagnosticsTests.EventCountersAndMetricsValues() in /_/src/Hosting/Hosting/test/HostingApplicationDiagnosticsTests.cs:line 68 --- End of stack trace from previous location --- ```
```text at System.Threading.Channels.ChannelReader`1.ReadAllAsync(CancellationToken cancellationToken)+MoveNext() at System.Threading.Channels.ChannelReader`1.ReadAllAsync(CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource.GetResult() at System.Collections.Generic.AsyncEnumerableExtensions.FirstOrDefault[T](IAsyncEnumerator`1 values, Func`2 filter) in /_/src/Hosting/Hosting/test/Internal/AsyncEnumerableExtensions.cs:line 10 at Microsoft.AspNetCore.Hosting.HostingEventSourceTests.VerifyCountersFireWithCorrectValues() in /_/src/Hosting/Hosting/test/Internal/HostingEventSourceTests.cs:line 207 --- End of stack trace from previous location --- ```

Logs

```text ```

Build

https://dev.azure.com/dnceng-public/public/_build/results?buildId=771079&view=ms.vss-test-web.build-test-results-tab https://dev.azure.com/dnceng-public/public/_build/results?buildId=771156&view=ms.vss-test-web.build-test-results-tab https://dev.azure.com/dnceng-public/public/_build/results?buildId=771449&view=ms.vss-test-web.build-test-results-tab https://dev.azure.com/dnceng-public/public/_build/results?buildId=769803&view=ms.vss-test-web.build-test-results-tab

halter73 commented 1 month ago

@JamesNK Do you know of any changes that might have affected Hosting's "requests-per-second", "total-requests", "current-requests", and/or "failed-requests" counters? Both of these tests verify those counters using the TestCounterListener.

JamesNK commented 1 month ago

No, I don't. There have been SignalR telemetry changes, but they're isolated to SignalR. And the Kestrel metrics connection improvement was merged a month ago.

It looks like errors started a few days ago: image

There was a runtime update on that day. Related? https://github.com/dotnet/aspnetcore/pull/57060

JamesNK commented 1 month ago

I looked at what VerifyCountersFireWithCorrectValues does and I'm sure the problem must be lower down. That test just verifies that HostingEventSource writes some event counters. That's it. And the code there hasn't changed in a year.

@tarekgh @noahfalk Have there been any changes in event counters recently?

tarekgh commented 1 month ago

I am not touching event counters at all and not aware of any changes there. @noahfalk should know better here.

noahfalk commented 1 month ago

This change merged recently and might be related - https://github.com/dotnet/runtime/pull/105548

I suspect what happened is the test is making assumptions that both requests will land in the same measurement interval so that some counter event will get reported where Increment=2. Such an assumption has always been relying on a race condition and not guaranteed to be true, but prior to 105548 it was probably very likely. The requests just needed to finish in less than 1 second so that they wouldn't miss the first measurement interval. After change 105548 the start of the first measurement interval is also asynchronous so its now possible for the requests to run fast enough that they complete before the first measurement interval starts.

If that is the issue, a couple things that could make the test more timing resillient:

JamesNK commented 1 month ago

That sounds like it. We can refactor the test to track increments across events.

This could affect more people. Although I doubt many people take the time to unit test counters.

JamesNK commented 1 month ago

PR with improvements: https://github.com/dotnet/aspnetcore/pull/57269

noahfalk commented 1 month ago

This could affect more people. Although I doubt many people take the time to unit test counters.

Yeah, its a risk but I think a worthwhile one in order to fix the bug. I'm not aware of a good option that preserves the fix without also disrupting tests that were making implicit assumptions about the 1st callback being synchronous.