Intermittent Failing Test: Lucene.Net.Search.TestMultiTermQueryRewrites.TestMaxClauseLimitations()

Is there an existing issue for this?

[X] I have searched the existing issues

Describe the bug

The Lucene.Net.Search.TestMultiTermQueryRewrites.TestMaxClauseLimitations() test fails under rare random conditions, but does not seem to repeat with the same random seed and culture.

Expected Behavior

The Lucene.Net.Search.TestMultiTermQueryRewrites.TestMaxClauseLimitations() test should always succeed.

Steps To Reproduce

Add the [Repeat(100000)] attribute to the test and it will fail frequently.

Exceptions (if any)

TestMaxClauseLimitations
   Source: TestMultiTermQueryRewrites.cs line 295
   Duration: 2.6 sec

  Message: 
Expected: CheckMaxClauseCount, Actual: Collect

To reproduce this test result:

Option 1:

 Apply the following assembly-level attributes:

[assembly: Lucene.Net.Util.RandomSeed("0x933174838b8c2dfd")]
[assembly: NUnit.Framework.SetCulture("fr-BL")]

Option 2:

 Use the following .runsettings file:

<RunSettings>
  <TestRunParameters>
    <Parameter name="tests:seed" value="0x933174838b8c2dfd" />
    <Parameter name="tests:culture" value="fr-BL" />
  </TestRunParameters>
</RunSettings>

Option 3:

 Create the following lucene.testsettings.json file somewhere between the test assembly and the root of your drive:

{
  "tests": {
     "seed": "0x933174838b8c2dfd",
     "culture": "fr-BL"
  }
}

  Stack Trace: 
TestMultiTermQueryRewrites.CheckMaxClauseLimitation(RewriteMethod method, String memberName) line 268
TestMultiTermQueryRewrites.TestMaxClauseLimitations() line 297
InvokeStub_TestMultiTermQueryRewrites.TestMaxClauseLimitations(Object, Object, IntPtr*)
1)    at Lucene.Net.Search.TestMultiTermQueryRewrites.CheckMaxClauseLimitation(RewriteMethod method, String memberName) in F:\Projects\lucenenet\src\Lucene.Net.Tests\Search\TestMultiTermQueryRewrites.cs:line 262
TestMultiTermQueryRewrites.TestMaxClauseLimitations() line 297
InvokeStub_TestMultiTermQueryRewrites.TestMaxClauseLimitations(Object, Object, IntPtr*)

Lucene.NET Version

4.8.0-beta00017

.NET Version

net8.0

Operating System

Windows

Anything else?

It is known to fail on Windows x64 with net8.0. It may fail also fail under other configurations.

This test should be compared with the behavior in Lucene 4.8.0. For some reason we are sometimes seeing the exception thrown from the Collect() method when we are expecting it to always be thrown from the CheckMaxClauseCount() method according to the way the test is written.

Note also that the assert message is commented out, and it should not be.

I can reproduce with enough repeats on net8.0 macOS arm64. I'm fairly certain that what we're seeing is .NET 8's Dynamic PGO inlining this trivial method at runtime. Much more further reading, but search the page for "inline" to see that it does inline trivial methods if it determines that is beneficial: https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-8/

We could force prevent this with [MethodImpl(MethodImplOptions.NoInlining)] on CheckMaxClauseCount (and initial results in my testing shows that seems to fix it), but IMO we should not rely on that. I think we should remove this assertion, even though it exists upstream (where Java did not have anything like .NET 8's Dynamic PGO to affect this stack trace), with a LUCENENET-specific comment explaining why we're removing it:

//  Maybe remove this assert in later versions, when internal API changes:
Assert.AreEqual("CheckMaxClauseCount", new StackTrace(e, false).GetFrames()[0].GetMethod().Name); //, "Should throw BooleanQuery.TooManyClauses with a stacktrace containing checkMaxClauseCount()");

We should only care that a BooleanQuery.TooManyClausesException is thrown here, and not which specific method throws it.

Actually, it would be a bug if we didn't have [MethodImpl(MethodImplOptions.NoInlining)] on CheckMaxClauseCount. That is a requirement of all of the tests that check for method names in the stack (and always has been). These "who called me" checks exist in Lucene for a reason.

There are a few dozen of these stack trace checks in the tests. We took [MethodImpl(MethodImplOptions.NoInlining)] a bit too far, because it was difficult to distinguish between identically named methods as to which one was expected in the stack (see #931). So, we can remove some of these without causing problems. However, this is a case where we didn't go far enough.

While I would like to find a better way to check these scenarios, I am having trouble coming up with a way doesn't significantly deviate from the upstream code. Checking the stack seems to be the easiest approach. And with optimizations enabled it requires [MethodImpl(MethodImplOptions.NoInlining)] to function.

These "who called me" checks exist in Lucene for a reason.

What is that reason? If anything, the comment on it indicates a keenness to remove the assertion, rather than keep it, because it acknowledges that it's testing an internal API. Users should only be concerned that the exception was thrown as expected, not that a specific protected method in the call stack was the one that did the check.

Perhaps we could ask the Lucene team for their thoughts, but my gut tells me that if they had anything as awesome as Dynamic PGO at the time, they would lean on the side of removing this assertion rather than harming performance by forcing it not to be inlined, just to keep the assertion in the test (which has no meaningful benefit to end users).

Remember that Dynamic PGO is the runtime observing ways that it can tangibly improve performance based on real-world measured behavior, and then making those changes. By inhibiting this with NoInlining, we're choosing to harm performance just for purity with the original assertion. IMO Dynamic PGO's inlining is a feature, not a bug. And I think we should do a sweep to remove all such assertions and NoInlining that was added to make these assertions pass. Let's err on the side of letting the runtime give us better performance, than having to make sure our stack trace matches some decade+ old version of the JDK's behavior. Especially when that assertion's author was even questioning whether it should be there in the first place. We could be considered the "later version" of this code that is changing the internal API, at runtime, via Dynamic PGO.

apache / lucenenet