eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.27k stars 721 forks source link

Apache Lucene CI builds sometimes fail with OpenJ9 specific issues #18400

Open mikemccand opened 10 months ago

mikemccand commented 10 months ago

Hello from Apache Lucene project, where we run CI builds with various JVMs and versions, including OpenJ9.

We sometimes see CI failures that seem to be OpenJ9 specific, and we don't have many dev resources to pursue them. You can see the recent failures in Apache mail list archives. Each test failure includes a Reproduce with: command-line that in theory will recreate the failure in your dev area.

This one looks more interesting:

The real reason for this is buried in other stack traces from
barrier-broken threads, it's this assertion:

Caused by:
java.lang.AssertionError
at __randomizedtesting.SeedInfo.seed([F7B4CD7A5624D5EC]:0)
at app//org.junit.Assert.fail(Assert.java:87)
at app//org.junit.Assert.assertTrue(Assert.java:42)
at app//org.junit.Assert.assertTrue(Assert.java:53)
at app//org.apache.lucene.index.TestIndexWriterThreadsToSegments$CheckSegmentCount.run(TestIndexWriterThreadsToSegments.java:150)
at java.base@17.0.5/java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:222)
at java.base@17.0.5/java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:364)
at app//org.apache.lucene.index.TestIndexWriterThreadsToSegments$2.run(TestIndexWriterThreadsToSegments.java:236)

I'm not sure if this is actionable by any OpenJ9 devs, but I wanted at least to establish some contact between our two projects so that together we could somehow get to the root cause on some of these issues. The root cause might be in Lucene (e.g. failing to properly detect number of bytes for an object pointer in the current JVM -- there is some JVM vendor specific logic I think) or might be something specific to OpenJ9.

We seem to be testing this OpenJ9 version: 64bit/openj9/jdk-17.0.5.

mikemccand commented 10 months ago

Quick update: it turns out some of those failures were in fact a bug in Lucene! So please ignore the failures about RamUsageEstimator. See small discussion here. It really is valuable to Lucene that we CI on many different JVMs ...

But I think the above exception is possibly J9 specific.

tajila commented 10 months ago

@mikemccand Thanks for letting us know. Do you have instructions on how we can reproduce the issue here

mikemccand commented 10 months ago

Hi @tajila -- normally the Lucene build failures have a nice Reproduce with: ... line which you can copy/paste into a Lucene 9.x clone. But this particular build failure is missing that for some reason.

That said, if you clone Lucene's branch_9x from https://github.com/apache/lucene, possibly roll back to revision around August 2, 2023, and run:

./gradlew -p lucene/core -Dtests.seed=F7B4CD7A5624D5EC test --tests TestIndexWriterThreadsToSegments -Dtests.jvmargs="-XX:+UseCompressedOops -Xgcpolicy:balanced"  

It will try to reproduce. However, I just tried that, with OpenJ9 openjdk full version "17.0.8.1+1" (downloaded from the IBM developerworks site) and it fails to reproduce for me (the test is using threads which often leads to unreproducible seeds).

When this happens (failure to reproduce the test on one run) we can sometimes add another parameter to run test many times, -Ptests.iters=N, and it may reproduce the failure. I just did that:

./gradlew -p lucene/core -Dtests.seed=F7B4CD7A5624D5EC test --tests TestIndexWriterThreadsToSegments -Dtests.jvmargs="-XX:+UseCompressedOops -Xgcpolicy:balanced" -Ptests.iters=100    

And the failure reproduced!:

org.apache.lucene.index.TestIndexWriterThreadsToSegments > testSegmentCountOnFlushRandom {seed=[F7B4CD7A5624D5EC:FB994C7CD25C034A]} FAILED                                                                          
    java.lang.AssertionError                                                                                                                                                                                        
        at __randomizedtesting.SeedInfo.seed([F7B4CD7A5624D5EC]:0)                                                                                                                                                  
        at org.junit.Assert.fail(Assert.java:87)                                                                                                                                                                    
        at org.junit.Assert.assertTrue(Assert.java:42)                                                                                                                                                              
        at org.junit.Assert.assertTrue(Assert.java:53)                                                                                                                                                              
        at org.apache.lucene.index.TestIndexWriterThreadsToSegments$CheckSegmentCount.run(TestIndexWriterThreadsToSegments.java:150)                                                                                
        at java.base/java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:222)                                                                                                                              
        at java.base/java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:364)                                                                                                                               
        at org.apache.lucene.index.TestIndexWriterThreadsToSegments$2.run(TestIndexWriterThreadsToSegments.java:236)                                                                                                

  2> NOTE: reproduce with: gradlew test --tests TestIndexWriterThreadsToSegments.testSegmentCountOnFlushRandom -Dtests.seed=F7B4CD7A5624D5EC -Dtests.locale=en-DK -Dtests.timezone=HST -Dtests.asserts=true -Dtests.file.encoding=US-ASCII                                                                                                                                                                                             
  2> NOTE: test params are: codec=CheapBastard, sim=Asserting(RandomSimilarity(queryNorm=false): {field=DFR GLZ(0.3)}), locale=en-DK, timezone=HST                                                                  
  2> NOTE: Linux 6.4.11-arch2-1 amd64/IBM Corporation 17.0.8.1 (64-bit)/cpus=32,threads=1,free=398591152,total=529530880                                                                                            
  2> NOTE: All tests run in this JVM: [TestIndexWriterThreadsToSegments]                                                                                                                                            

This doesn't repro with OpenJDK 21 (I had to remove the -Xgcpolicy:balanced.

tajila commented 10 months ago

Thanks @mikemccand I was able to reproduce this with:

openjdk version "17.0.9-internal" 2023-10-17
OpenJDK Runtime Environment (build 17.0.9-internal+0-adhoc.jenkins.BuildJDK17x86-64linuxNightly)
Eclipse OpenJ9 VM (build master-b614f6cd7f7, JRE 17 Linux amd64-64-Bit Compressed References 20231103_611 (JIT enabled, AOT enabled)
OpenJ9   - b614f6cd7f7
OMR      - 3cdf41f96fc
JCL      - aa764da8110 based on jdk-17.0.9+9)
mikemccand commented 10 months ago

I was able to reproduce this with:

Great!

Here is another possibly J9 specific failure: https://jenkins.thetaphi.de/job/Lucene-main-Linux/45480/

org.apache.lucene.index.TestIndexSorting.txt, copied below:
   >     org.apache.lucene.index.CheckIndex$CheckIndexException: vector term=[0 0 0] field=sparse_text does not exist in postings; doc=66
   >         at __randomizedtesting.SeedInfo.seed([741EFE91CE6EF9D7:2AEA93B992E44A5B]:0)
   >         at app//org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:3700)
   >         at app//org.apache.lucene.index.CheckIndex.testSegment(CheckIndex.java:1013)
   >         at app//org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:721)
   >         at app//org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:559)
   >         at app//org.apache.lucene.tests.util.TestUtil.checkIndex(TestUtil.java:343)
   >         at app//org.apache.lucene.tests.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:909)
   >         at app//org.apache.lucene.util.IOUtils.close(IOUtils.java:85)
   >         at app//org.apache.lucene.util.IOUtils.close(IOUtils.java:72)
   >         at app//org.apache.lucene.index.TestIndexSorting.testIndexSortWithSparseField(TestIndexSorting.java:2769)
   >         at java.base@17.0.5/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   >         at java.base@17.0.5/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
   >         at java.base@17.0.5/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   >         at java.base@17.0.5/java.lang.reflect.Method.invoke(Method.java:568)
   >         at app//com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
   >         at app//com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
   >         at app//com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
   >         at app//com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
   >         at app//org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
   >         at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   >         at app//org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
   >         at app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
   >         at app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
   >         at app//org.junit.rules.RunRules.evaluate(RunRules.java:20)
   >         at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
   >         at app//com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
   >         at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
   >         at app//com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
   >         at app//com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
   >         at app//com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
   >         at app//com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
   >         at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   >         at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at app//org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
   >         at app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
   >         at app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
   >         at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at app//org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
   >         at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   >         at app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
   >         at app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
   >         at app//org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
   >         at app//org.junit.rules.RunRules.evaluate(RunRules.java:20)
   >         at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
   >         at app//com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
   >         at java.base@17.0.5/java.lang.Thread.run(Thread.java:857)
  2> NOTE: reproduce with: gradlew test --tests TestIndexSorting.testIndexSortWithSparseField -Dtests.seed=741EFE91CE6EF9D7 -Dtests.multiplier=3 -Dtests.locale=ksb-TZ -Dtests.timezone=Africa/Gaborone -Dtests.asserts=true -Dtests.file.encoding=UTF-8
  2> NOTE: leaving temporary files on disk at: /home/jenkins/workspace/Lucene-main-Linux/lucene/core/build/tmp/tests-tmp/lucene.index.TestIndexSorting_741EFE91CE6EF9D7-001
  2> NOTE: test params are: codec=Asserting(Lucene99): {field=PostingsFormat(name=LuceneVarGapFixedInterval), docs=PostingsFormat(name=Asserting), sparse_text=PostingsFormat(name=LuceneVarGapFixedInterval), norms=PostingsFormat(name=Asserting), positions=PostingsFormat(name=LuceneVarGapFixedInterval), id=PostingsFormat(name=Asserting), term_vectors=BlockTreeOrds(blocksize=128)}, docValues:{multi_valued_long=DocValuesFormat(name=Lucene90), double=DocValuesFormat(name=Asserting), foo=DocValuesFormat(name=Asserting), numeric=DocValuesFormat(name=Asserting), sparse_binary=DocValuesFormat(name=Lucene90), sort=DocValuesFormat(name=Asserting), float=DocValuesFormat(name=Lucene90), multi_valued_numeric=DocValuesFormat(name=Asserting), int=DocValuesFormat(name=Lucene90), long=DocValuesFormat(name=Lucene90), sparse_int=DocValuesFormat(name=Asserting), bar=DocValuesFormat(name=Lucene90), sorted=DocValuesFormat(name=Lucene90), multi_valued_double=DocValuesFormat(name=Asserting), sparse=DocValuesFormat(name=Lucene90), multi_valued_string=DocValuesFormat(name=Lucene90), bytes=DocValuesFormat(name=Lucene90), binary=DocValuesFormat(name=Lucene90), dense_int=DocValuesFormat(name=Lucene90), id=DocValuesFormat(name=Lucene90), multi_valued_int=DocValuesFormat(name=Asserting), multi_valued_bytes=DocValuesFormat(name=Asserting), multi_valued_float=DocValuesFormat(name=Lucene90)}, maxPointsInLeafNode=1114, maxMBSortInHeap=6.390174474628566, sim=Asserting(RandomSimilarity(queryNorm=false): {field=DFR GBZ(0.3), positions=IB SPL-L2, id=DFR I(ne)B3(800.0), term_vectors=DFR I(ne)B2}), locale=ksb-TZ, timezone=Africa/Gaborone
  2> NOTE: Linux 6.2.0-35-generic amd64/IBM Corporation 17.0.5 (64-bit)/cpus=16,threads=1,free=224949352,total=268435456
  2> NOTE: All tests run in this JVM: [TestDemo, TestAnalysisSPILoader, TestCharFilter, TestStopFilter, TestPackedTokenAttributeImpl, TestDeflateWithPresetDictCompressionMode, TestLucene90CompoundFormat, TestLucene90NormsFormatMergeInstance, TestMSBVLong, TestLucene99HnswQuantizedVectorsFormat, TestPerFieldDocValuesFormat, TestDocument, TestFieldType, TestLatLonDocValuesMultiPointPointQueries, TestLatLonMultiPointShapeQueries, TestLatLonPointShapeDVQueries, TestLongDistanceFeatureQuery, TestSortedSetDocValuesSetQuery, TestXYPointShapeDVQueries, TestCircle, TestPoint2D, TestXYCircle, Test2BDocs, Test2BSortedDocValuesFixedSorted, TestAllFilesDetectMismatchedChecksum, TestBagOfPositions, TestByteSliceReader, TestConcurrentMergeScheduler, TestDeletionPolicy, TestDocIDMerger, TestDocsWithFieldSet, TestDuelingCodecsAtNight, TestFieldUpdatesBuffer, TestFilterMergePolicy, TestIndexCommit, TestIndexSorting]
mikemccand commented 10 months ago

There's some OpenJ9 specific discussion about this build failure: https://lists.apache.org/thread/svt6bqqwdkb4kq7b9zhx630n4sj27ovq

uschindler commented 10 months ago

There's some OpenJ9 specific discussion about this build failure: https://lists.apache.org/thread/svt6bqqwdkb4kq7b9zhx630n4sj27ovq

This is unrelated to OpenJ9, it is caused by errorprone not working with forked compilers.

mikemccand commented 10 months ago

There's some OpenJ9 specific discussion about this build failure: https://lists.apache.org/thread/svt6bqqwdkb4kq7b9zhx630n4sj27ovq

This is unrelated to OpenJ9, it is caused by errorprone not working with forked compilers.

Woops thanks for the correction @uschindler.

This one looks maybe unique to OpenJ9?

https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/14098

  2> NOTE: reproduce with: gradlew test --tests TestLucene90DocValuesFormat.testSortedVariableLengthBigVsStoredFields -Dtests.seed=4CB1759046D1E087 -Dtests.multiplier=3 -Dtests.locale=bo-IN -Dtests.timezone=Asia/Jerusalem -Dtests.asserts=true -Dtests.file.encoding=UTF-8
  2> ཟླ་༡༡ 21, 2023 5:38:35 སྔ་དྲོ་ com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
  2> WARNING: Uncaught exception in thread: Thread[Lucene Merge Thread #0,5,TGRP-TestLucene90DocValuesFormat]
  2> org.apache.lucene.index.MergePolicy$MergeException: java.lang.RuntimeException: Invalid vInt detected (too many bits)
  2>    at __randomizedtesting.SeedInfo.seed([4CB1759046D1E087]:0)
  2>    at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:735)
  2>    at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:727)
  2> Caused by: java.lang.RuntimeException: Invalid vInt detected (too many bits)
  2>    at org.apache.lucene.store.ByteArrayDataInput.readVInt(ByteArrayDataInput.java:128)
  2>    at org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$TermsDict.next(Lucene90DocValuesProducer.java:1100)
  2>    at org.apache.lucene.index.TermsEnumIndex.next(TermsEnumIndex.java:95)
  2>    at org.apache.lucene.index.OrdinalMap.<init>(OrdinalMap.java:269)
  2>    at org.apache.lucene.index.OrdinalMap.build(OrdinalMap.java:176)
  2>    at org.apache.lucene.codecs.DocValuesConsumer.mergeSortedField(DocValuesConsumer.java:663)
  2>    at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:146)
  2>    at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:154)
  2>    at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:178)
  2>    at org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:298)
  2>    at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:140)
  2>    at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5177)
  2>    at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4710)
  2>    at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6465)
  2>    at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:639)
  2>    at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:700)
  2> 
   >     com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=875, name=Lucene Merge Thread #0, state=RUNNABLE, group=TGRP-TestLucene90DocValuesFormat]
   >         at __randomizedtesting.SeedInfo.seed([4CB1759046D1E087:7938F85897C76D67]:0)
   > 
   >         Caused by:
   >         org.apache.lucene.index.MergePolicy$MergeException: java.lang.RuntimeException: Invalid vInt detected (too many bits)
   >             at __randomizedtesting.SeedInfo.seed([4CB1759046D1E087]:0)
   >             at app//org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:735)
   >             at app//org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:727)
   > 
   >             Caused by:
   >             java.lang.RuntimeException: Invalid vInt detected (too many bits)
   >                 at app//org.apache.lucene.store.ByteArrayDataInput.readVInt(ByteArrayDataInput.java:128)
   >                 at app//org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$TermsDict.next(Lucene90DocValuesProducer.java:1100)
   >                 at app//org.apache.lucene.index.TermsEnumIndex.next(TermsEnumIndex.java:95)
   >                 at app//org.apache.lucene.index.OrdinalMap.<init>(OrdinalMap.java:269)
   >                 at app//org.apache.lucene.index.OrdinalMap.build(OrdinalMap.java:176)
   >                 at app//org.apache.lucene.codecs.DocValuesConsumer.mergeSortedField(DocValuesConsumer.java:663)
   >                 at app//org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:146)
   >                 at app//org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:154)
   >                 at app//org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:178)
   >                 at app//org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:298)
   >                 at app//org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:140)
   >                 at app//org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5177)
   >                 at app//org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4710)
   >                 at app//org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6465)
   >                 at app//org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:639)
   >                 at app//org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:700)
  2> NOTE: reproduce with: gradlew test --tests TestLucene90DocValuesFormat.testSortedMergeAwayAllValues -Dtests.seed=4CB1759046D1E087 -Dtests.multiplier=3 -Dtests.locale=bo-IN -Dtests.timezone=Asia/Jerusalem -Dtests.asserts=true -Dtests.file.encoding=UTF-8
  2> NOTE: leaving temporary files on disk at: /home/jenkins/workspace/Lucene-9.x-Linux/lucene/core/build/tmp/tests-tmp/lucene.codecs.lucene90.TestLucene90DocValuesFormat_4CB1759046D1E087-001
  2> NOTE: test params are: codec=CheapBastard, sim=Asserting(RandomSimilarity(queryNorm=true): {fieldname=DFR I(ne)L2, docId=IB SPL-DZ(0.3), id=LM Jelinek-Mercer(0.100000)}), locale=bo-IN, timezone=Asia/Jerusalem
  2> NOTE: Linux 6.2.0-36-generic amd64/IBM Corporation 11.0.20.1 (64-bit)/cpus=16,threads=1,free=134479232,total=470810624
  2> NOTE: All tests run in this JVM: [TestExternalCodecs, TestAnalyzerWrapper, TestCharacterUtils, TestWordlistLoader, TestSimpleAttributeImpl, TestFastCompressionMode, TestLucene90DocValuesFormat]
uschindler commented 10 months ago

This one looks maybe unique to OpenJ9?

Yes it is, This is clearly a bug, it looks like it is flipping bits in the vInt encoding due to some problems in code optimizations. It only happens with OpenJ9 and this is an ongoing issue (happens quite often).

uschindler commented 10 months ago

The above run also produced other errors. What is interesting: Once OpenJ9 gets into an invalid state it then produces more and more errors in tests as followup. The above run also broke LZ4 compression:

java.lang.AssertionError
    at __randomizedtesting.SeedInfo.seed([4CB1759046D1E087:200409C91DCA2BF1]:0)
    at app//org.apache.lucene.util.compress.LZ4.decompress(LZ4.java:116)
    at app//org.apache.lucene.codecs.compressing.CompressionMode$4.decompress(CompressionMode.java:135)
    at app//org.apache.lucene.codecs.compressing.AbstractTestCompressionMode.decompress(AbstractTestCompressionMode.java:77)
    at app//org.apache.lucene.codecs.compressing.AbstractTestCompressionMode.decompress(AbstractTestCompressionMode.java:71)
    at app//org.apache.lucene.codecs.compressing.AbstractTestCompressionMode.test(AbstractTestCompressionMode.java:131)
    at app//org.apache.lucene.codecs.compressing.AbstractTestCompressionMode.test(AbstractTestCompressionMode.java:126)
    at app//org.apache.lucene.codecs.compressing.AbstractTestCompressionMode.testExtremelyLargeInput(AbstractTestCompressionMode.java:163)
uschindler commented 10 months ago

The main issue here looks like ByteArrayDataInput was miscompiled and reading from it creates some incorrect decoded integers/bytes/shorts. Aftre that downstream code like LZ4 decompression or reading VInts (that are ints which are encoded like UTF-8 sequences) then hits assertions.

singh264 commented 9 months ago

@tajila I am investigating the problem in the GitHub issue.

singh264 commented 9 months ago

I was able to reproduce this with: openjdk version "17.0.9-internal" 2023-10-17 OpenJDK Runtime Environment (build 17.0.9-internal+0-adhoc.jenkins.BuildJDK17x86-64linuxNightly) Eclipse OpenJ9 VM (build master-b614f6cd7f7, JRE 17 Linux amd64-64-Bit Compressed References 20231103_611 (JIT enabled, AOT enabled) OpenJ9 - b614f6cd7f7 OMR - 3cdf41f96fc JCL - aa764da8110 based on jdk-17.0.9+9)

@tajila how did you ensure that the jdk.jfr.consumer package was recognized during the compilation process for Lucene when reproducing the test?

root@c24919v1:~/openj9_issues_18400/lucene# $JAVA_HOME/bin/java -version
openjdk version "17.0.9-internal" 2023-10-17
OpenJDK Runtime Environment (build 17.0.9-internal+0-adhoc.root.openj9-openjdk-jdk17)
Eclipse OpenJ9 VM (build openj9-0.41.0, JRE 17 Linux amd64-64-Bit Compressed References 20231124_000000 (JIT enabled, AOT enabled)
OpenJ9   - 461bf3c70
OMR      - 5eee6ad9d
JCL      - 3699725139c based on jdk-17.0.9+9)
root@c24919v1:~/openj9_issues_18400/lucene# ./gradlew -p lucene/core -Dtests.seed=F7B4CD7A5624D5EC test --tests TestIndexWriterThreadsToSegments -Dtests.jvmargs="-XX:+UseCompressedOops -Xgcpolicy:balanced" -Ptests.iters=100 --info --scan 
Initialized native services in: /root/.gradle/native
Initialized jansi services in: /root/.gradle/native
Received JVM installation metadata from '/root/openj9_issues_18400/openj9-openjdk-jdk17/build/linux-x86_64-server-release/images/jdk': {JAVA_HOME=/root/openj9_issues_18400/openj9-openjdk-jdk17/build/linux-x86_64-server-release/images/jdk, JAVA_VERSION=17.0.9-internal, JAVA_VENDOR=Eclipse OpenJ9, RUNTIME_NAME=OpenJDK Runtime Environment, RUNTIME_VERSION=17.0.9-internal+0-adhoc.root.openj9-openjdk-jdk17, VM_NAME=Eclipse OpenJ9 VM, VM_VERSION=openj9-0.41.0, VM_VENDOR=Eclipse OpenJ9, OS_ARCH=amd64}
The client will now receive all logging from the daemon (pid: 3401592). The daemon log file: /root/.gradle/daemon/8.4/daemon-3401592.out.log
Starting 11th build in daemon [uptime: 31 mins 36.069 secs, performance: 98%]
Using 4 worker leases.
Now considering [/root/openj9_issues_18400/lucene, /root/openj9_issues_18400/lucene/dev-tools/missing-doclet, /root/openj9_issues_18400/lucene/buildSrc] as hierarchies to watch
Now considering [/root/openj9_issues_18400/lucene/buildSrc, /root/openj9_issues_18400/lucene, /root/openj9_issues_18400/lucene/dev-tools/missing-doclet] as hierarchies to watch
Watching the file system is configured to be enabled if available
File system watching is active
Starting Build
Settings evaluated using settings file '/root/openj9_issues_18400/lucene/settings.gradle'.
Now considering [/root/openj9_issues_18400/lucene/dev-tools/missing-doclet, /root/openj9_issues_18400/lucene/buildSrc, /root/openj9_issues_18400/lucene] as hierarchies to watch
Projects loaded. Root project using build file '/root/openj9_issues_18400/lucene/build.gradle'.
Included projects: [root project 'lucene-root', project ':lucene', project ':lucene:analysis', project ':lucene:analysis.tests', project ':lucene:backward-codecs', project ':lucene:benchmark', project ':lucene:classification', project ':lucene:codecs', project ':lucene:core', project ':lucene:core.tests', project ':lucene:demo', project ':lucene:distribution', project ':lucene:distribution.tests', project ':lucene:documentation', project ':lucene:expressions', project ':lucene:facet', project ':lucene:grouping', project ':lucene:highlighter', project ':lucene:join', project ':lucene:luke', project ':lucene:memory', project ':lucene:misc', project ':lucene:monitor', project ':lucene:queries', project ':lucene:queryparser', project ':lucene:replicator', project ':lucene:sandbox', project ':lucene:spatial-extras', project ':lucene:spatial-test-fixtures', project ':lucene:spatial3d', project ':lucene:suggest', project ':lucene:test-framework', project ':lucene:analysis:common', project ':lucene:analysis:icu', project ':lucene:analysis:kuromoji', project ':lucene:analysis:morfologik', project ':lucene:analysis:morfologik.tests', project ':lucene:analysis:nori', project ':lucene:analysis:opennlp', project ':lucene:analysis:phonetic', project ':lucene:analysis:smartcn', project ':lucene:analysis:stempel']

> Configure project :missing-doclet
Evaluating project ':missing-doclet' using build file '/root/openj9_issues_18400/lucene/dev-tools/missing-doclet/build.gradle'.
Registering project ':missing-doclet' in composite build. Will substitute for module 'org.apache.lucene.tools:missing-doclet'.

> Configure project :buildSrc
Evaluating project ':buildSrc' using build file '/root/openj9_issues_18400/lucene/buildSrc/build.gradle'.
Resolve mutations for :buildSrc:compileJava (Thread[Execution worker,5,main]) started.
:buildSrc:compileJava (Thread[Execution worker,5,main]) started.

> Task :buildSrc:compileJava FAILED
Caching disabled for task ':buildSrc:compileJava' because:
  Build cache is disabled
Task ':buildSrc:compileJava' is not up-to-date because:
  Task has failed previously.
The input changes require a full rebuild for incremental task ':buildSrc:compileJava'.
Watching 13 directories to track changes
Watching 12 directories to track changes
Watching 11 directories to track changes
Watching 10 directories to track changes
Full recompilation is required because no incremental change information is available. This is usually caused by clean builds or changing compiler arguments.
Compiling with toolchain '/root/openj9_issues_18400/openj9-openjdk-jdk17/build/linux-x86_64-server-release/images/jdk'.
Compiling with JDK Java compiler API.
/root/openj9_issues_18400/lucene/buildSrc/src/main/java/org/apache/lucene/gradle/ProfileResults.java:30: error: package jdk.jfr.consumer does not exist
import jdk.jfr.consumer.RecordedClass;
                       ^
/root/openj9_issues_18400/lucene/buildSrc/src/main/java/org/apache/lucene/gradle/ProfileResults.java:31: error: package jdk.jfr.consumer does not exist
import jdk.jfr.consumer.RecordedEvent;
                       ^
/root/openj9_issues_18400/lucene/buildSrc/src/main/java/org/apache/lucene/gradle/ProfileResults.java:32: error: package jdk.jfr.consumer does not exist
import jdk.jfr.consumer.RecordedFrame;
                       ^
/root/openj9_issues_18400/lucene/buildSrc/src/main/java/org/apache/lucene/gradle/ProfileResults.java:33: error: package jdk.jfr.consumer does not exist
import jdk.jfr.consumer.RecordedMethod;
                       ^
/root/openj9_issues_18400/lucene/buildSrc/src/main/java/org/apache/lucene/gradle/ProfileResults.java:34: error: package jdk.jfr.consumer does not exist
import jdk.jfr.consumer.RecordedStackTrace;
                       ^
/root/openj9_issues_18400/lucene/buildSrc/src/main/java/org/apache/lucene/gradle/ProfileResults.java:35: error: package jdk.jfr.consumer does not exist
import jdk.jfr.consumer.RecordedThread;
                       ^
/root/openj9_issues_18400/lucene/buildSrc/src/main/java/org/apache/lucene/gradle/ProfileResults.java:36: error: package jdk.jfr.consumer does not exist
import jdk.jfr.consumer.RecordingFile;
                       ^
/root/openj9_issues_18400/lucene/buildSrc/src/main/java/org/apache/lucene/gradle/ProfileResults.java:48: error: cannot find symbol
  static String frameToString(RecordedFrame frame, boolean lineNumbers) {
                              ^
  symbol:   class RecordedFrame
  location: class ProfileResults
/root/openj9_issues_18400/lucene/buildSrc/src/main/java/org/apache/lucene/gradle/ProfileResults.java:101: error: cannot find symbol
  static boolean isInteresting(String mode, RecordedEvent event) {
                                            ^
  symbol:   class RecordedEvent
  location: class ProfileResults
/root/openj9_issues_18400/lucene/buildSrc/src/main/java/org/apache/lucene/gradle/ProfileResults.java:116: error: cannot find symbol
  static boolean isGradlePollThread(RecordedThread thread) {
                                    ^
  symbol:   class RecordedThread
  location: class ProfileResults
/root/openj9_issues_18400/lucene/buildSrc/src/main/java/org/apache/lucene/gradle/ProfileResults.java:121: error: cannot find symbol
  static long getValue(RecordedEvent event) {
                       ^
  symbol:   class RecordedEvent
  location: class ProfileResults
/root/openj9_issues_18400/lucene/buildSrc/src/main/java/org/apache/lucene/gradle/ProfileResults.java:50: error: cannot find symbol
    RecordedMethod method = frame.getMethod();
    ^
  symbol:   class RecordedMethod
  location: class ProfileResults
/root/openj9_issues_18400/lucene/buildSrc/src/main/java/org/apache/lucene/gradle/ProfileResults.java:51: error: cannot find symbol
    RecordedClass clazz = method.getType();
    ^
  symbol:   class RecordedClass
  location: class ProfileResults
/root/openj9_issues_18400/lucene/buildSrc/src/main/java/org/apache/lucene/gradle/ProfileResults.java:170: error: cannot find symbol
      try (RecordingFile recording = new RecordingFile(Paths.get(file))) {
           ^
  symbol:   class RecordingFile
  location: class ProfileResults
/root/openj9_issues_18400/lucene/buildSrc/src/main/java/org/apache/lucene/gradle/ProfileResults.java:170: error: cannot find symbol
      try (RecordingFile recording = new RecordingFile(Paths.get(file))) {
                                         ^
  symbol:   class RecordingFile
  location: class ProfileResults
/root/openj9_issues_18400/lucene/buildSrc/src/main/java/org/apache/lucene/gradle/ProfileResults.java:172: error: cannot find symbol
          RecordedEvent event = recording.readEvent();
          ^
  symbol:   class RecordedEvent
  location: class ProfileResults
/root/openj9_issues_18400/lucene/buildSrc/src/main/java/org/apache/lucene/gradle/ProfileResults.java:176: error: cannot find symbol
          RecordedStackTrace trace = event.getStackTrace();
          ^
  symbol:   class RecordedStackTrace
  location: class ProfileResults
Note: /root/openj9_issues_18400/lucene/buildSrc/src/main/java/org/apache/lucene/gradle/Checksum.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
17 errors
Watching 11 directories to track changes
Watching 12 directories to track changes
Watching 13 directories to track changes
Watching 14 directories to track changes

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':buildSrc:compileJava'.
> Compilation failed; see the compiler error output for details.

BUILD FAILED in 3s
1 actionable task: 1 executed

Publishing a build scan to scans.gradle.com requires accepting the Gradle Terms of Service defined at https://gradle.com/terms-of-service. Do you accept these terms? [yes, no] yes   

Gradle Terms of Service accepted.

Publishing build scan...
<-------------> 4% CONFIGURING [1m 3s]
https://gradle.com/s/ietctyxbc2nlm
uschindler commented 9 months ago

I was able to reproduce this with: openjdk version "17.0.9-internal" 2023-10-17 OpenJDK Runtime Environment (build 17.0.9-internal+0-adhoc.jenkins.BuildJDK17x86-64linuxNightly) Eclipse OpenJ9 VM (build master-b614f6cd7f7, JRE 17 Linux amd64-64-Bit Compressed References 20231103_611 (JIT enabled, AOT enabled) OpenJ9 - b614f6c OMR - 3cdf41f96fc JCL - aa764da8110 based on jdk-17.0.9+9)

@tajila how did you ensure that the jdk.jfr.consumer package was recognized during the compilation process for Lucene when reproducing the test?

root@c24919v1:~/openj9_issues_18400/lucene# $JAVA_HOME/bin/java -version
openjdk version "17.0.9-internal" 2023-10-17
OpenJDK Runtime Environment (build 17.0.9-internal+0-adhoc.root.openj9-openjdk-jdk17)
Eclipse OpenJ9 VM (build openj9-0.41.0, JRE 17 Linux amd64-64-Bit Compressed References 20231124_000000 (JIT enabled, AOT enabled)
OpenJ9   - 461bf3c70
OMR      - 5eee6ad9d
JCL      - 3699725139c based on jdk-17.0.9+9)

I am working on a solution to disable Java Flight Recorder Support when the module isn't there. This is just a build system issue and unrelated to the issues here.

During Lucene builds we always use Hotspot VMs, but you can run the test suite with any other JVM. Just pass RUNTIME_JAVA_HOME when invoking Gradle. JAVA_HOME should be a Hotspot JDK and is solely used to run Gradle.

uschindler commented 9 months ago

Hi, I created https://github.com/apache/lucene/pull/12845 to work around the J9 VM not having Java Flight Recorder by checking for the module:

>gradlew :lucene:core:test

> Configure project :buildSrc
Module jdk.jfr is not available; skipping compilation of Java Flight Recorder support.
uschindler commented 9 months ago

Code was merged, if you update Lucene's main branch, you should be able to test this with pure OpenJ9 (also to run Gradle).

singh264 commented 9 months ago

During Lucene builds we always use Hotspot VMs, but you can run the test suite with any other JVM. Just pass RUNTIME_JAVA_HOME when invoking Gradle. JAVA_HOME should be a Hotspot JDK and is solely used to run Gradle.

I was able to reproduce the test failure in TestIndexWriterThreadsToSegments but not the test failure in TestLucene90DocValuesFormat.testSortedVariableLengthBigVsStoredFields.

$JAVA_HOME/bin/java -version
java version "17.0.9" 2023-10-17 LTS
Java(TM) SE Runtime Environment (build 17.0.9+11-LTS-201)
Java HotSpot(TM) 64-Bit Server VM (build 17.0.9+11-LTS-201, mixed mode, sharing)

$RUNTIME_JAVA_HOME/bin/java -version
openjdk version "17.0.9-internal" 2023-10-17
OpenJDK Runtime Environment (build 17.0.9-internal+0-adhoc.root.openj9-openjdk-jdk17)
Eclipse OpenJ9 VM (build openj9-0.41.0, JRE 17 Linux amd64-64-Bit Compressed References 20231124_000000 (JIT enabled, AOT enabled)
OpenJ9   - 461bf3c70
OMR      - 5eee6ad9d
JCL      - 3699725139c based on jdk-17.0.9+9)

Code was merged, if you update Lucene's main branch, you should be able to test this with pure OpenJ9 (also to run Gradle).

$JAVA_HOME/bin/java -version
openjdk version "17.0.9-internal" 2023-10-17
OpenJDK Runtime Environment (build 17.0.9-internal+0-adhoc.root.openj9-openjdk-jdk17)
Eclipse OpenJ9 VM (build openj9-0.41.0, JRE 17 Linux amd64-64-Bit Compressed References 20231124_000000 (JIT enabled, AOT enabled)
OpenJ9   - 461bf3c70
OMR      - 5eee6ad9d
JCL      - 3699725139c based on jdk-17.0.9+9)
./gradlew -p lucene/core -Dtests.seed=F7B4CD7A5624D5EC test --tests TestIndexWriterThreadsToSegments -Dtests.jvmargs="-XX:+UseCompressedOops -Xgcpolicy:balanced" -Ptests.iters=100

...

> Task :lucene:core:test
WARNING: A command line option has enabled the Security Manager
WARNING: The Security Manager is deprecated and will be removed in a future release
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by java.lang.System
WARNING: Please consider reporting this to the maintainers of java.lang.System
WARNING: System::setSecurityManager will be removed in a future release

org.apache.lucene.index.TestIndexWriterThreadsToSegments > testSegmentCountOnFlushRandom {seed=[F7B4CD7A5624D5EC:E0E420713DA7EABA]} FAILED
    java.lang.AssertionError
        at __randomizedtesting.SeedInfo.seed([F7B4CD7A5624D5EC]:0)
        at org.junit.Assert.fail(Assert.java:87)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.junit.Assert.assertTrue(Assert.java:53)
        at org.apache.lucene.index.TestIndexWriterThreadsToSegments$CheckSegmentCount.run(TestIndexWriterThreadsToSegments.java:150)
        at java.base/java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:222)
        at java.base/java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:364)
        at org.apache.lucene.index.TestIndexWriterThreadsToSegments$2.run(TestIndexWriterThreadsToSegments.java:236)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=2434, name=Thread-2402, state=RUNNABLE, group=TGRP-TestIndexWriterThreadsToSegments]

        Caused by:
        java.lang.RuntimeException: java.util.concurrent.BrokenBarrierException
            at __randomizedtesting.SeedInfo.seed([F7B4CD7A5624D5EC]:0)
            at org.apache.lucene.index.TestIndexWriterThreadsToSegments$2.run(TestIndexWriterThreadsToSegments.java:239)

            Caused by:
            java.util.concurrent.BrokenBarrierException
                at java.base/java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:252)
                at java.base/java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:364)
                at org.apache.lucene.index.TestIndexWriterThreadsToSegments$2.run(TestIndexWriterThreadsToSegments.java:236)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=2436, name=Thread-2404, state=RUNNABLE, group=TGRP-TestIndexWriterThreadsToSegments]

        Caused by:
        java.lang.RuntimeException: java.util.concurrent.BrokenBarrierException
            at __randomizedtesting.SeedInfo.seed([F7B4CD7A5624D5EC]:0)
            at org.apache.lucene.index.TestIndexWriterThreadsToSegments$2.run(TestIndexWriterThreadsToSegments.java:239)

            Caused by:
            java.util.concurrent.BrokenBarrierException
                at java.base/java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:252)
                at java.base/java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:364)
                at org.apache.lucene.index.TestIndexWriterThreadsToSegments$2.run(TestIndexWriterThreadsToSegments.java:236)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=2435, name=Thread-2403, state=RUNNABLE, group=TGRP-TestIndexWriterThreadsToSegments]

        Caused by:
        java.lang.RuntimeException: java.util.concurrent.BrokenBarrierException
            at __randomizedtesting.SeedInfo.seed([F7B4CD7A5624D5EC]:0)
            at org.apache.lucene.index.TestIndexWriterThreadsToSegments$2.run(TestIndexWriterThreadsToSegments.java:239)

            Caused by:
            java.util.concurrent.BrokenBarrierException
                at java.base/java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:252)
                at java.base/java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:364)
                at org.apache.lucene.index.TestIndexWriterThreadsToSegments$2.run(TestIndexWriterThreadsToSegments.java:236)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=2440, name=Thread-2408, state=RUNNABLE, group=TGRP-TestIndexWriterThreadsToSegments]

        Caused by:
        java.lang.RuntimeException: java.util.concurrent.BrokenBarrierException
            at __randomizedtesting.SeedInfo.seed([F7B4CD7A5624D5EC]:0)
            at org.apache.lucene.index.TestIndexWriterThreadsToSegments$2.run(TestIndexWriterThreadsToSegments.java:239)

            Caused by:
            java.util.concurrent.BrokenBarrierException
                at java.base/java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:252)
                at java.base/java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:364)
                at org.apache.lucene.index.TestIndexWriterThreadsToSegments$2.run(TestIndexWriterThreadsToSegments.java:236)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=2441, name=Thread-2409, state=RUNNABLE, group=TGRP-TestIndexWriterThreadsToSegments]

        Caused by:
        java.lang.RuntimeException: java.util.concurrent.BrokenBarrierException
            at __randomizedtesting.SeedInfo.seed([F7B4CD7A5624D5EC]:0)
            at org.apache.lucene.index.TestIndexWriterThreadsToSegments$2.run(TestIndexWriterThreadsToSegments.java:239)

            Caused by:
            java.util.concurrent.BrokenBarrierException
                at java.base/java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:252)
                at java.base/java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:364)
                at org.apache.lucene.index.TestIndexWriterThreadsToSegments$2.run(TestIndexWriterThreadsToSegments.java:236)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=2439, name=Thread-2407, state=RUNNABLE, group=TGRP-TestIndexWriterThreadsToSegments]

        Caused by:
        java.lang.RuntimeException: java.util.concurrent.BrokenBarrierException
            at __randomizedtesting.SeedInfo.seed([F7B4CD7A5624D5EC]:0)
            at org.apache.lucene.index.TestIndexWriterThreadsToSegments$2.run(TestIndexWriterThreadsToSegments.java:239)

            Caused by:
            java.util.concurrent.BrokenBarrierException
                at java.base/java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:252)
                at java.base/java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:364)
                at org.apache.lucene.index.TestIndexWriterThreadsToSegments$2.run(TestIndexWriterThreadsToSegments.java:236)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=2433, name=Thread-2401, state=RUNNABLE, group=TGRP-TestIndexWriterThreadsToSegments]

        Caused by:
        java.lang.RuntimeException: java.util.concurrent.BrokenBarrierException
            at __randomizedtesting.SeedInfo.seed([F7B4CD7A5624D5EC]:0)
            at org.apache.lucene.index.TestIndexWriterThreadsToSegments$2.run(TestIndexWriterThreadsToSegments.java:239)

            Caused by:
            java.util.concurrent.BrokenBarrierException
                at java.base/java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:252)
                at java.base/java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:364)
                at org.apache.lucene.index.TestIndexWriterThreadsToSegments$2.run(TestIndexWriterThreadsToSegments.java:236)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=2438, name=Thread-2406, state=RUNNABLE, group=TGRP-TestIndexWriterThreadsToSegments]

        Caused by:
        java.lang.RuntimeException: java.util.concurrent.BrokenBarrierException
            at __randomizedtesting.SeedInfo.seed([F7B4CD7A5624D5EC]:0)
            at org.apache.lucene.index.TestIndexWriterThreadsToSegments$2.run(TestIndexWriterThreadsToSegments.java:239)

            Caused by:
            java.util.concurrent.BrokenBarrierException
                at java.base/java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:252)
                at java.base/java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:364)
                at org.apache.lucene.index.TestIndexWriterThreadsToSegments$2.run(TestIndexWriterThreadsToSegments.java:236)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=2442, name=Thread-2410, state=RUNNABLE, group=TGRP-TestIndexWriterThreadsToSegments]

        Caused by:
        java.lang.RuntimeException: java.util.concurrent.BrokenBarrierException
            at __randomizedtesting.SeedInfo.seed([F7B4CD7A5624D5EC]:0)
            at org.apache.lucene.index.TestIndexWriterThreadsToSegments$2.run(TestIndexWriterThreadsToSegments.java:239)

            Caused by:
            java.util.concurrent.BrokenBarrierException
                at java.base/java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:252)
                at java.base/java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:364)
                at org.apache.lucene.index.TestIndexWriterThreadsToSegments$2.run(TestIndexWriterThreadsToSegments.java:236)
...
singh264 commented 9 months ago

@tajila what could be the next step(s) to find the root cause of the problem based on the analysis below?

It seems like creating a micro test that reproduces the issue or identifying the possible commit(s) in the lucene repo that may have caused the issue could be next steps.

openj9 jdk17

openjdk version "17.0.9-internal" 2023-10-17
OpenJDK Runtime Environment (build 17.0.9-internal+0-adhoc.root.openj9-openjdk-jdk17)
Eclipse OpenJ9 VM (build openj9-0.41.0, JRE 17 Linux amd64-64-Bit Compressed References 20231124_000000 (JIT enabled, AOT enabled)
OpenJ9   - 461bf3c70
OMR      - 5eee6ad9d
JCL      - 3699725139c based on jdk-17.0.9+9)

openj9 jdk21

openjdk version "21.0.1-internal" 2023-10-17
OpenJDK Runtime Environment (build 21.0.1-internal-adhoc.root.openj9-openjdk-jdk21)
Eclipse OpenJ9 VM (build master-1ba18acb3, JRE 21 Linux amd64-64-Bit Compressed References 20231128_000000 (JIT enabled, AOT enabled)
OpenJ9   - 1ba18acb3
OMR      - 8b19b8082
JCL      - 6f6d9413679 based on jdk-21.0.1+12)

hotspot jdk17

java version "17.0.9" 2023-10-17 LTS
Java(TM) SE Runtime Environment (build 17.0.9+11-LTS-201)
Java HotSpot(TM) 64-Bit Server VM (build 17.0.9+11-LTS-201, mixed mode, sharing)

hotspot jdk21

java version "21.0.1" 2023-10-17 LTS
Java(TM) SE Runtime Environment (build 21.0.1+12-LTS-29)
Java HotSpot(TM) 64-Bit Server VM (build 21.0.1+12-LTS-29, mixed mode, sharing
tajila commented 9 months ago

I would start by looking at what causes the assertion failure

at __randomizedtesting.SeedInfo.seed([F7B4CD7A5624D5EC]:0)
        at org.junit.Assert.fail(Assert.java:87)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.junit.Assert.assertTrue(Assert.java:53)
        at org.apache.lucene.index.TestIndexWriterThreadsToSegments$CheckSegmentCount.run(TestIndexWriterThreadsToSegments.java:150)

Since you have access to the source you can instrument it to see where the discrepency arises from.

singh264 commented 9 months ago

The intermittent assertion failure in testSegmentCountOnFlushRandom seems to have started after https://github.com/apache/lucene/commit/bae5b338ed93b352062e6feb8321945fbed11ca5 was added in the lucene repo, and a solution to fix the intermittent assertion failure is to make the DocumentsWriterPerThreadPool.getAndLock() method synchronized as it seems to fix a race condition that occurs while multiple threads concurrently add documents to an index during the test.

uschindler commented 9 months ago

Thanks for the report. Indeed the method can be synchronized, because newWriter() is synchronized anyways, so it won't add more contention.

@mikemccand do you want to take the lead as you know IndexWriter better than I do?

I think the more serious issues here are the ones corrupting data while reading; there's no concurrency involved: https://github.com/eclipse-openj9/openj9/issues/18400#issuecomment-1820631659

singh264 commented 9 months ago

I think the more serious issues here are the ones corrupting data while reading; there's no concurrency involved: https://github.com/eclipse-openj9/openj9/issues/18400#issuecomment-1820631659

I was able to reproduce the failure in TestLucene90DocValuesFormat.testSortedVariableLengthBigVsStoredFields in branch_9x in the lucene repo by adding the TEST_JVM_ARGS to the reproduce with: command:

./gradlew test --tests TestLucene90DocValuesFormat.testSortedVariableLengthBigVsStoredFields -Dtests.seed=4CB1759046D1E087 -Dtests.multiplier=3 -Dtests.locale=bo-IN -Dtests.timezone=Asia/Jerusalem -Dtests.asserts=true -Dtests.file.encoding=UTF-8 -Dtests.jvmargs="-XX:+UseCompressedOops -Xgcpolicy:balanced"
$JAVA_HOME/bin/java -version
java version "11.0.20" 2023-07-18 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.20+9-LTS-256)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.20+9-LTS-256, mixed mode)

$RUNTIME_JAVA_HOME/bin/java -version
openjdk version "11.0.22-internal" 2024-01-16
OpenJDK Runtime Environment (build 11.0.22-internal+0-adhoc.root.openj9-openjdk-jdk11)
Eclipse OpenJ9 VM (build master-21f742fbc, JRE 11 Linux amd64-64-Bit Compressed References 20231208_000000 (JIT enabled, AOT enabled)
OpenJ9   - 21f742fbc
OMR      - 137778af4
JCL      - 22b36573fd based on jdk-11.0.22+6)

@tajila the cause of the failure in TestLucene90DocValuesFormat.testSortedVariableLengthBigVsStoredFields seems to be due to the -Xgcpolicy option and the JIT:

mikemccand commented 9 months ago

@mikemccand do you want to take the lead as you know IndexWriter better than I do?

Thanks @singh264 and @uschindler -- I'll open a Lucene GitHub issue to track this on Lucene's side! Love the open source collaboration here!

singh264 commented 9 months ago

@tajila would it be appropriate to request the GC or JIT team to assist in finding the root cause of the failure in TestLucene90DocValuesFormat.testSortedVariableLengthBigVsStoredFields, which seems to be related to the GC and JIT components based on https://github.com/eclipse-openj9/openj9/issues/18400#issuecomment-1850177955?

singh264 commented 9 months ago

A workaround for the failure in TestLucene90DocValuesFormat.testSortedVariableLengthBigVsStoredFields is to use the default -Xgcpolicy:gencon.

singh264 commented 9 months ago

@dmitripivkine requesting your feedback on this issue. This issue indicated two lucene test failures (testSegmentCountOnFlushRandom and testSortedVariableLengthBigVsStoredFields) with the cause that seemed specific to openj9:

pshipton commented 9 months ago

@hzongaro pls take a look, https://github.com/eclipse-openj9/openj9/issues/18400#issuecomment-1850177955 indicates a JIT issue.

dmitripivkine commented 9 months ago

@dmitripivkine requesting your feedback on this issue. This issue indicated two lucene test failures (testSegmentCountOnFlushRandom and testSortedVariableLengthBigVsStoredFields) with the cause that seemed specific to openj9:

  • The cause of the failure in testSegmentCountOnFlushRandom seems to be due to a lucene code change based on this.
  • The cause of the failure in testSortedVariableLengthBigVsStoredFields seems to be due to openj9 GC and JIT components based on this.

The fact that Balanced and Metronome work with -Xint pointed to JIT most likely. Both of these GC policies required JIT to be configured differently (disabled optimizations mostly). So, I think JIT investigation is a reasonable next point. If you have a reason to think it is GC related we need system core captured at assertion point. Even with this core it might be hard to connect failure to GC activity, so detected discrepancies explanation appreciated.

singh264 commented 9 months ago

A core file was generated with:

./gradlew test --tests TestLucene90DocValuesFormat.testSortedVariableLengthBigVsStoredFields -Dtests.seed=4CB1759046D1E087 -Dtests.multiplier=3 -Dtests.locale=bo-IN -Dtests.timezone=Asia/Jerusalem -Dtests.asserts=true -Dtests.file.encoding=UTF-8 -Dtests.jvmargs="-XX:+UseCompressedOops -Xgcpolicy:balanced -Xdump:java+system:events=throw,filter=java/lang/RuntimeException"
> !threads
    !stack 0x0001aa00   !j9vmthread 0x0001aa00  !j9thread 0x7f5600007660    tid 0x3ede42 (4120130) // (Test worker)
    !stack 0x0001e600   !j9vmthread 0x0001e600  !j9thread 0x7f5600b472a0    tid 0x3ede5c (4120156) // (JIT Compilation Thread-000 Suspended)
    !stack 0x00022300   !j9vmthread 0x00022300  !j9thread 0x7f5600b47818    tid 0x3ede5d (4120157) // (JIT Compilation Thread-001)
    !stack 0x00025f00   !j9vmthread 0x00025f00  !j9thread 0x7f5600b487b0    tid 0x3ede5e (4120158) // (JIT Compilation Thread-002 Suspended)
    !stack 0x00029c00   !j9vmthread 0x00029c00  !j9thread 0x7f5600b48d28    tid 0x3ede5f (4120159) // (JIT Compilation Thread-003 Suspended)
    !stack 0x0002d800   !j9vmthread 0x0002d800  !j9thread 0x7f5600b49cc0    tid 0x3ede60 (4120160) // (JIT Compilation Thread-004 Suspended)
    !stack 0x00031500   !j9vmthread 0x00031500  !j9thread 0x7f5600b4a238    tid 0x3ede61 (4120161) // (JIT Compilation Thread-005 Suspended)
    !stack 0x00035100   !j9vmthread 0x00035100  !j9thread 0x7f5600b4b1d0    tid 0x3ede62 (4120162) // (JIT Compilation Thread-006 Suspended)
    !stack 0x00038e00   !j9vmthread 0x00038e00  !j9thread 0x7f5600b4b748    tid 0x3ede63 (4120163) // (JIT Diagnostic Compilation Thread-007 Suspended)
    !stack 0x0003ca00   !j9vmthread 0x0003ca00  !j9thread 0x7f5600bab310    tid 0x3ede65 (4120165) // (JIT-SamplerThread)
    !stack 0x00040700   !j9vmthread 0x00040700  !j9thread 0x7f5600bab888    tid 0x3ede66 (4120166) // (IProfiler)
    !stack 0x0013c500   !j9vmthread 0x0013c500  !j9thread 0x7f5600bdeb10    tid 0x3ede67 (4120167) // (Common-Cleaner)
    !stack 0x00221300   !j9vmthread 0x00221300  !j9thread 0x7f5600cded40    tid 0x3ede69 (4120169) // (Dedicated GC Main)
    !stack 0x00224f00   !j9vmthread 0x00224f00  !j9thread 0x7f5600cdf2b8    tid 0x3ede6a (4120170) // (GC Worker)
    !stack 0x00228c00   !j9vmthread 0x00228c00  !j9thread 0x7f5600cdffd0    tid 0x3ede6b (4120171) // (GC Worker)
    !stack 0x0022c800   !j9vmthread 0x0022c800  !j9thread 0x7f5600ce0548    tid 0x3ede6c (4120172) // (GC Worker)
    !stack 0x00230500   !j9vmthread 0x00230500  !j9thread 0x7f5600ce1260    tid 0x3ede6d (4120173) // (GC Worker)
    !stack 0x00234100   !j9vmthread 0x00234100  !j9thread 0x7f5600ce17d8    tid 0x3ede6e (4120174) // (GC Worker)
    !stack 0x00237e00   !j9vmthread 0x00237e00  !j9thread 0x7f5600ce24f0    tid 0x3ede6f (4120175) // (GC Worker)
    !stack 0x0023ba00   !j9vmthread 0x0023ba00  !j9thread 0x7f5600ce2a68    tid 0x3ede70 (4120176) // (GC Worker)
    !stack 0x00260400   !j9vmthread 0x00260400  !j9thread 0x7f5600ce45f8    tid 0x3ede72 (4120178) // (Attach API wait loop)
    !stack 0x00247700   !j9vmthread 0x00247700  !j9thread 0x7f5600ce4080    tid 0x3ede73 (4120179) // (ClassByNameCache Reaper)
    !stack 0x003d2c00   !j9vmthread 0x003d2c00  !j9thread 0x7f5600f94610    tid 0x3ede78 (4120184) // (/127.0.0.1:35490 to /127.0.0.1:45649 workers)
    !stack 0x003e6900   !j9vmthread 0x003e6900  !j9thread 0x7f5600f94b88    tid 0x3ede79 (4120185) // (/127.0.0.1:35490 to /127.0.0.1:45649 workers Thread 2)
    !stack 0x003ea500   !j9vmthread 0x003ea500  !j9thread 0x7f5600fbbb60    tid 0x3ede7a (4120186) // (/127.0.0.1:35490 to /127.0.0.1:45649 workers Thread 3)
    !stack 0x004c0b00   !j9vmthread 0x004c0b00  !j9thread 0x7f5600fbc0d8    tid 0x3ede7b (4120187) // (SUITE-TestLucene90DocValuesFormat-seed#[4CB1759046D1E087])
    !stack 0x003e7f00   !j9vmthread 0x003e7f00  !j9thread 0x7f55440e5ba0    tid 0x3ede7c (4120188) // (TEST-TestLucene90DocValuesFormat.testSortedVariableLengthBigVsStoredFields-seed#[4CB1759046D1E087])
    !stack 0x006d3e00   !j9vmthread 0x006d3e00  !j9thread 0x7f55440e6118    tid 0x3ede7e (4120190) // (Finalizer thread)
    !stack 0x006d3000   !j9vmthread 0x006d3000  !j9thread 0x7f554050af10    tid 0x3ede7f (4120191) // (Lucene Merge Thread #0) <-----

> !stack 0x006d3000
<6d3000>                            Generic special frame
<6d3000>    !j9method 0x00000000005CD458   org/apache/lucene/store/ByteArrayDataInput.readVInt()I
<6d3000>    !j9method 0x000000000073E358   org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer$TermsDict.next()Lorg/apache/lucene/util/BytesRef;
<6d3000>    !j9method 0x0000000000739488   org/apache/lucene/index/TermsEnumIndex.next()Lorg/apache/lucene/util/BytesRef;
<6d3000>    !j9method 0x000000000073F258   org/apache/lucene/index/OrdinalMap.<init>(Lorg/apache/lucene/index/IndexReader$CacheKey;[Lorg/apache/lucene/index/TermsEnum;Lorg/apache/lucene/index/OrdinalMap$SegmentMap;F)V
<6d3000>    !j9method 0x000000000073F238   org/apache/lucene/index/OrdinalMap.build(Lorg/apache/lucene/index/IndexReader$CacheKey;[Lorg/apache/lucene/index/TermsEnum;[JF)Lorg/apache/lucene/index/OrdinalMap;
<6d3000>    !j9method 0x00000000005A4090   org/apache/lucene/codecs/DocValuesConsumer.mergeSortedField(Lorg/apache/lucene/index/FieldInfo;Lorg/apache/lucene/index/MergeState;)V
<6d3000>    !j9method 0x00000000005A3FF0   org/apache/lucene/codecs/DocValuesConsumer.merge(Lorg/apache/lucene/index/MergeState;)V
<6d3000>    !j9method 0x00000000005A4220   org/apache/lucene/codecs/perfield/PerFieldDocValuesFormat$FieldsWriter.merge(Lorg/apache/lucene/index/MergeState;)V
<6d3000>    !j9method 0x000000000072C560   org/apache/lucene/index/SegmentMerger.mergeDocValues(Lorg/apache/lucene/index/SegmentWriteState;Lorg/apache/lucene/index/SegmentReadState;)V
<6d3000>    !j9method 0x0000000000743F98   org/apache/lucene/index/SegmentMerger$$Lambda$205/0x0000000000000000.merge(Lorg/apache/lucene/index/SegmentWriteState;Lorg/apache/lucene/index/SegmentReadState;)V
<6d3000>    !j9method 0x000000000072C680   org/apache/lucene/index/SegmentMerger.mergeWithLogging(Lorg/apache/lucene/index/SegmentMerger$VoidMerger;Lorg/apache/lucene/index/SegmentWriteState;Lorg/apache/lucene/index/SegmentReadState;Ljava/lang/String;I)V
<6d3000>    !j9method 0x000000000072C520   org/apache/lucene/index/SegmentMerger.merge()Lorg/apache/lucene/index/MergeState;
<6d3000>    !j9method 0x000000000048E798   org/apache/lucene/index/IndexWriter.mergeMiddle(Lorg/apache/lucene/index/MergePolicy$OneMerge;Lorg/apache/lucene/index/MergePolicy;)I
<6d3000>    !j9method 0x000000000048E618   org/apache/lucene/index/IndexWriter.merge(Lorg/apache/lucene/index/MergePolicy$OneMerge;)V
<6d3000>    !j9method 0x0000000000632738   org/apache/lucene/index/IndexWriter$IndexWriterMergeSource.merge(Lorg/apache/lucene/index/MergePolicy$OneMerge;)V
<6d3000>    !j9method 0x000000000045ED98   org/apache/lucene/index/ConcurrentMergeScheduler.doMerge(Lorg/apache/lucene/index/MergeScheduler$MergeSource;Lorg/apache/lucene/index/MergePolicy$OneMerge;)V
<6d3000>    !j9method 0x00000000006D83F8   org/apache/lucene/index/ConcurrentMergeScheduler$MergeThread.run()V
<6d3000>                            JNI call-in frame
<6d3000>                            Native method frame

> !stackslots 0x006d3000
<6d3000> *** BEGIN STACK WALK, flags = 00400001 walkThread = 7,155,712 ***
<6d3000>    ITERATE_O_SLOTS
<6d3000>    RECORD_BYTECODE_PC_OFFSET
<6d3000> Initial values: walkSP = 0x00000000006F95C8, PC = 0x0000000000000001, literals = 0x0000000000000000, A0 = 0x00000000006F95E0, j2iFrame = 0x0000000000000000, decomp = 0x0000000000000000
<6d3000> Generic special frame: bp = 0x00000000006F95E0, sp = 0x00000000006F95C8, pc = 0x0000000000000001, cp = 0x0000000000000000, arg0EA = 0x00000000006F95E0, flags = 0x0000000000000000
<6d3000> Bytecode frame: bp = 0x00000000006F95F8, sp = 0x00000000006F95E8, pc = 0x00007F5540119215, cp = 0x00000000005CD4E0, arg0EA = 0x00000000006F9610, flags = 0x0000000000000000
<6d3000>    Method: org/apache/lucene/store/ByteArrayDataInput.readVInt()I !j9method 0x00000000005CD458
<6d3000>    Bytecode index = 173
<6d3000>    Using local mapper
<6d3000>    Locals starting at 0x00000000006F9610 for 0x0000000000000003 slots
<6d3000>        I-Slot: a0[0x00000000006F9610] = 0x00000000FACADAB0 <-----
<6d3000>        I-Slot: t1[0x00000000006F9608] = 0x00000000FFFFFF8F
<6d3000>        I-Slot: t2[0x00000000006F9600] = 0x00000000F1E3C78F
<6d3000> Bytecode frame: bp = 0x00000000006F9630, sp = 0x00000000006F9618, pc = 0x00007F552C07B0EA, cp = 0x000000000073D9F0, arg0EA = 0x00000000006F9658, flags = 0x0000000000000000
...

> !j9object 0x00000000FACADAB0
!J9Object 0x00000000FACADAB0 {
    struct J9Class* clazz = !j9class 0x5CD000 // org/apache/lucene/store/ByteArrayDataInput
    Object flags = 0x00000000;
    I lockword = 0x00000000 (offset = 0) (java/lang/Object) <hidden>
    [B bytes = !fj9object 0xfa780000 (offset = 4) (org/apache/lucene/store/ByteArrayDataInput)
    I pos = 0x00001BF3 (offset = 8) (org/apache/lucene/store/ByteArrayDataInput)
    I limit = 0x000A179E (offset = 12) (org/apache/lucene/store/ByteArrayDataInput)
}
hzongaro commented 9 months ago

@BradleyWood, may I ask you to take a look at what are potentially JIT-related failures reported here?

dmitripivkine commented 9 months ago

A core file was generated with:

@singh264 I have no access to the file, you should change permissions in the box.

Also please summarize the problem for this particular core (deeper than trivial output you have provided).

singh264 commented 9 months ago

I have no access to the file, you should change permissions in the box

The core file can now be downloaded with the google drive link, I was unable to extend the expiration date of the box link.

point to exception line in the source code describe what is expected behaviour is and what is observed instead

A RuntimeException is thrown by ByteArrayDataInput.readVInt() when running the TestLucene90DocValuesFormat.testSortedVariableLengthBigVsStoredFields test.

  public int readVInt() {
    byte b = bytes[pos++];
    if (b >= 0) return b;
    int i = b & 0x7F;
    b = bytes[pos++];
    i |= (b & 0x7F) << 7;
    if (b >= 0) return i;
    b = bytes[pos++];
    i |= (b & 0x7F) << 14;
    if (b >= 0) return i;
    b = bytes[pos++];
    i |= (b & 0x7F) << 21;
    if (b >= 0) return i;
    b = bytes[pos++];
    // Warning: the next ands use 0x0F / 0xF0 - beware copy/paste errors:
    i |= (b & 0x0F) << 28;
    if ((b & 0xF0) == 0) return i;
    throw new RuntimeException("Invalid vInt detected (too many bits)");
  }

The expected behaviour is that readVInt() reads an integer from the byte array by reading the lower 7 bits of each byte until the most significant bit of the byte is 0 (b >= 0) and without reading more than 32 bits from the byte array. The actual behaviour is that the byte array includes 5 consecutive bytes with most significant bit that is 1, which suggests that the integer is stored in more than 32 bits in the byte array:

> !j9object 0x00000000FACADAB0
!J9Object 0x00000000FACADAB0 {
    struct J9Class* clazz = !j9class 0x5CD000 // org/apache/lucene/store/ByteArrayDataInput
    Object flags = 0x00000000;
    I lockword = 0x00000000 (offset = 0) (java/lang/Object) <hidden>
    [B bytes = !fj9object 0xfa780000 (offset = 4) (org/apache/lucene/store/ByteArrayDataInput) <-----
    I pos = 0x00001BF3 (offset = 8) (org/apache/lucene/store/ByteArrayDataInput) // 7155
    I limit = 0x000A179E (offset = 12) (org/apache/lucene/store/ByteArrayDataInput)
}

> !j9indexableobject 0x00000000FA780000  7145 7156
!J9IndexableObject 0x00000000FA780000 {
    struct J9Class* clazz = !j9arrayclass 0x4A700   // [B
    Object flags = 0x00000000;
    U_32 size = 0x000A77C7;
    [7145] =  13, 0x0d
    [7146] =  13, 0x0d
    [7147] =  13, 0x0d
    [7148] = 143, 0x8f
    [7149] = 143, 0x8f
    [7150] = 143, 0x8f <----- read
    [7151] = 143, 0x8f <----- read
    [7152] = 143, 0x8f <----- read
    [7153] = 143, 0x8f <----- read
    [7154] = 143, 0x8f <----- read
    [7155] = 143, 0x8f
To print entire range: !j9indexableobject 0x00000000FA780000 0 686023

any ideas how test behaviour might be impacted by GC (collected/not collected objects, References processing involved etc.)

I'd like guidance on exploring these suggestions to understand how GC-related components impact the test behaviour. I will investigate the utilization of bytes array, with the default -Xgcpolicy:gencon option (where the test passes) and the -Xgcpolicy:balanced option (where the test fails) within the java code, and I will examine the -Xverbosegclog to know when the GC is invoked, especially where invalid data seems to be written to the bytes array.

dmitripivkine commented 9 months ago

One specific for Balanced (and Metronome of course) detail is object !j9indexableobject 0x00000000FA780000 0 686023 is Discontiguous Hybrid 0xa77c7 elements total:

> !j9x 0x00000000FA780000,80,4
0xFA780000 :  0004a700 00000000 000a77c7 00000000 [ .........w...... ]
0xFA780010 :  00000000 00000000 fa980000 fa780020 [ ............ .x. ]
0xFA780020 :  00ffffff 00000000 76767600 8c767676 [ .........vvvvvv. ]
0xFA780030 :  00212121 955757d8 00dfdfdf eeaeae7a [ !!!..WW.....z... ]
0xFA780040 :  ff6f6f6f cccccc40 f7f7f7f7 bef7f7f7 [ ooo.@........... ]
0xFA780050 :  ff8b8b8b 2b2b2b2d 15151515 82151515 [ ....-+++........ ]
0xFA780060 :  ff5c5c5c ff9b9b99 00ffffff 36363607 [ \\\..........666 ]
0xFA780070 :  3e3e3e3e 6c3e3e3e ffa5a5a5 70d0d092 [ >>>>>>>l.......p ]

0xFA980000 :  000000003732f200 ffffffffddd91106 [ ..27............ ]
0xFA980010 :  ffffffffac4b9c60 ffffffffdba9111b [ `.K............. ]
0xFA980020 :  ffffffffb57f9eb4 00000000603fdf79 [ ........y.?`.... ]
0xFA980030 :  ffffffff98d36ae7 00000000547952f1 [ .j.......RyT.... ]
0xFA980040 :  fffffffff5b14679 ffffffffd28071b3 [ yF.......q...... ]
0xFA980050 :  ffffffffedc0d5e3 00000000621a2912 [ .........).b.... ]
0xFA980060 :  ffffffffceaa83d3 0000000058b2e4fb [ ...........X.... ]
0xFA980070 :  000000001e51643d 000000001884cdcf [ =dQ............. ]
0xFA980080 :  000000001c213f7d 0000000051e2cd7e [ }?!.....~..Q.... ]
0xFA980090 :  000000003da39a1c ffffffffe9520a91 [ ...=......R..... ]
0xFA9800A0 :  ffffffffe97105a8 ffffffffe97bbaca [ ..q.......{..... ]
...

First 0x80000 elements are stored in the leaf region starting at 0xFA980000, reminder 0x277C7 elements are inlined to the Arraylet spine starting at 0xFA780020. There is no reason to suspect GC activity (core looks clean from GC perspective), however the problem might be at VM, JIT or test itself side somehow failed to handle Discontiguous Hybrid array format and assumes contiguous layout.

@singh264 Would you please try this experiment - increase region size from current 512k (0x80000) to 1m or even higher. I don't know can test use arrays larger than current 0xa77c7. Please use -Xgc:regionSize=1048576 for 1m or even larger (size of the region must be power of 2). If problem goes away with larger region size it is a good start point to investigate where Contiguous array is assumed wrongly.

singh264 commented 9 months ago

Would you please try this experiment - increase region size from current 512k (0x80000) to 1m or even higher. I don't know can test use arrays larger than current 0xa77c7. Please use -Xgc:regionSize=1048576 for 1m or even larger (size of the region must be power of 2)

When running TestLucene90DocValuesFormat.testSortedVariableLengthBigVsStoredFields:

./gradlew test --tests TestLucene90DocValuesFormat.testSortedVariableLengthBigVsStoredFields -Dtests.seed=4CB1759046D1E087 -Dtests.multiplier=3 -Dtests.locale=bo-IN -Dtests.timezone=Asia/Jerusalem -Dtests.asserts=true -Dtests.file.encoding=UTF-8 -Dtests.jvmargs="-XX:+UseCompressedOops -Xgcpolicy:balanced"
singh264 commented 8 months ago

The test TestLucene90DocValuesFormat.testSortedVariableLengthBigVsStoredFields passes with -Xjit:exclude={org/apache/lucene/util/compress/LZ4$HighCompressionHashTable*,java/lang/invoke/*} for both -Xgcpolicy:balanced and -Xgcpolicy:metronome.

./gradlew test --tests TestLucene90DocValuesFormat.testSortedVariableLengthBigVsStoredFields -Dtests.seed=4CB1759046D1E087 -Dtests.multiplier=3 -Dtests.locale=bo-IN -Dtests.timezone=Asia/Jerusalem -Dtests.asserts=true -Dtests.file.encoding=UTF-8 -Dtests.jvmargs="-XX:+UseCompressedOops -Xgcpolicy:balanced -Xjit:exclude={org/apache/lucene/util/compress/LZ4$HighCompressionHashTable*,java/lang/invoke/*}"
singh264 commented 8 months ago

The test TestLucene90DocValuesFormat.testSortedVariableLengthBigVsStoredFields passes with openjdk method handles enabled in openj9 jdk11.

babsingh commented 8 months ago

fyi @0xdaryl @hzongaro @jdmpapin @nbhuiyan, earlier in our meeting, I had referred to this user issue; the failure is most likely related to OpenJ9 MHs, and it is resolved with OJDK MHs as per https://github.com/eclipse-openj9/openj9/issues/18400#issuecomment-1883940429.

singh264 commented 3 months ago

How can I know the status of OJDK MHs as the failure is most likely related to OpenJ9 MHs, and it is resolved with OJDK MHs as per https://github.com/eclipse-openj9/openj9/issues/18400#issuecomment-1883940429?

singh264 commented 3 months ago

How can I know the status of OJDK MHs as the failure is most likely related to OpenJ9 MHs, and it is resolved with OJDK MHs as per https://github.com/eclipse-openj9/openj9/issues/18400#issuecomment-1883940429?

It seems like currently OJDK MHs cannot be enabled until the performance issues are resolved.

singh264 commented 3 months ago

It seems like currently OJDK MHs cannot be enabled until the performance issues are resolved.

How can I find information about the performance issues that need to be resolved?

babsingh commented 3 months ago

For JDK8 and JDK11, OJDK MH perf is being tracked through https://github.com/eclipse-openj9/openj9/issues/12728. We are trying different solutions to address the perf issues.

In the meanwhile, workarounds mentioned in the previous comments can be used to address the failure. JDK17+ can also be used to address the failure since OJDK MHs are enabled in JDK17+.

singh264 commented 3 months ago

How can I know if it is currently feasible to utilize JDK17+ to address the failure since OJDK MHs are enabled in JDK17+?

babsingh commented 3 months ago

Your previous test in https://github.com/eclipse-openj9/openj9/issues/18400#issuecomment-1883940429 implies that the issue will be addressed with JDK17+. Nothing more needs to be done.