Open halter73 opened 1 month ago
@JamesNK Do you know of any changes that might have affected Hosting's "requests-per-second", "total-requests", "current-requests", and/or "failed-requests" counters? Both of these tests verify those counters using the TestCounterListener
.
No, I don't. There have been SignalR telemetry changes, but they're isolated to SignalR. And the Kestrel metrics connection improvement was merged a month ago.
It looks like errors started a few days ago:
There was a runtime update on that day. Related? https://github.com/dotnet/aspnetcore/pull/57060
I looked at what VerifyCountersFireWithCorrectValues
does and I'm sure the problem must be lower down. That test just verifies that HostingEventSource
writes some event counters. That's it. And the code there hasn't changed in a year.
@tarekgh @noahfalk Have there been any changes in event counters recently?
I am not touching event counters at all and not aware of any changes there. @noahfalk should know better here.
This change merged recently and might be related - https://github.com/dotnet/runtime/pull/105548
I suspect what happened is the test is making assumptions that both requests will land in the same measurement interval so that some counter event will get reported where Increment=2. Such an assumption has always been relying on a race condition and not guaranteed to be true, but prior to 105548 it was probably very likely. The requests just needed to finish in less than 1 second so that they wouldn't miss the first measurement interval. After change 105548 the start of the first measurement interval is also asynchronous so its now possible for the requests to run fast enough that they complete before the first measurement interval starts.
If that is the issue, a couple things that could make the test more timing resillient:
That sounds like it. We can refactor the test to track increments across events.
This could affect more people. Although I doubt many people take the time to unit test counters.
PR with improvements: https://github.com/dotnet/aspnetcore/pull/57269
This could affect more people. Although I doubt many people take the time to unit test counters.
Yeah, its a risk but I think a worthwhile one in order to fix the bug. I'm not aware of a good option that preserves the fix without also disrupting tests that were making implicit assumptions about the 1st callback being synchronous.
Failing Test(s)
Microsoft.AspNetCore.Hosting.Tests.HostingApplicationDiagnosticsTests.EventCountersAndMetricsValues
Microsoft.AspNetCore.Hosting.HostingEventSourceTests.VerifyCountersFireWithCorrectValues
Error Message
Stacktrace
Logs
Build
https://dev.azure.com/dnceng-public/public/_build/results?buildId=771079&view=ms.vss-test-web.build-test-results-tab https://dev.azure.com/dnceng-public/public/_build/results?buildId=771156&view=ms.vss-test-web.build-test-results-tab https://dev.azure.com/dnceng-public/public/_build/results?buildId=771449&view=ms.vss-test-web.build-test-results-tab https://dev.azure.com/dnceng-public/public/_build/results?buildId=769803&view=ms.vss-test-web.build-test-results-tab