eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 721 forks source link

GC crashed with an assertion failure in Scavenger.cpp #17052

Closed connglli closed 9 months ago

connglli commented 1 year ago

Java version

openjdk version "11.0.19-internal" 2023-04-18
OpenJDK Runtime Environment (build 11.0.19-internal+0-adhoc..openj9-openjdk-jdk11)
Eclipse OpenJ9 VM (build master-f14707f, JRE 11 Linux amd64-64-Bit Compressed References 20230303_000000 (JIT enabled, AOT enabled)
OpenJ9   - f14707f
OMR      - 342b647
JCL      - a8cc681 based on jdk-11.0.19+4)

Javac version

javac 11.0.19-internal

Code and summary of the problem

Code and logs: see issue17052.tar.gz. The code Test.java is a bit long because I ain't able to reduce it.

This seems to be the most reproducible problem out of the 12 as mentioned in our previous chat. I can reproduce exactly the same assertion failure (Scavenger.cpp:1669) 3 times out of 4 runs. I put logs of all these 4 runs in the link above, each in a directory named log.XXXXX. Hope the same frequency applies to you too. It makes OpenJ9 to crash in GC and should be a JIT bug since there aren't any crash if -Xint is enabled.

The test exited with code 255 and the assertion log (log.bOTRV, log.D1Uma, log.RtuV0) looks like

Corruption in Evacuate at 00000000FFD00048: calculated object size 140448003833796 larger than available 1572792, Forwarded Header at 00007FBC752B7300
21:12:28.372 0x25a500    j9mm.141    *   ** ASSERTION FAILED ** at /root/hostdir/openj9-openjdk-jdk11/omr/gc/base/standard/Scavenger.cpp:1669: ((false))
JVMDUMP039I Processing dump event "traceassert", detail "" at 2023/03/15 21:12:28 - please wait.
JVMDUMP032I JVM requested System dump using '/home/cli/ax-exp/openj9/Artemis/core.20230315.211228.900368.0001.dmp' in response to an event
JVMPORT030W /proc/sys/kernel/core_pattern setting "|/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" specifies that the core dump is to be piped to an external program.  Attempting to rename either core or core.900864.  Review the manual for the external program to find where the core dump is written and ensure the program does not truncate it.

JVMPORT049I The core file created by child process with pid = 900864 was not found. Review the documentation for the /proc/sys/kernel/core_pattern program "|/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" to find where the core file is written and ensure that program does not truncate it.

JVMDUMP012E Error in System dump: /home/cli/ax-exp/openj9/Artemis/core.20230315.211228.900368.0001.dmp
JVMDUMP032I JVM requested Java dump using '/home/cli/ax-exp/openj9/Artemis/javacore.20230315.211228.900368.0002.txt' in response to an event
JVMDUMP010I Java dump written to /home/cli/ax-exp/openj9/Artemis/javacore.20230315.211228.900368.0002.txt
JVMDUMP032I JVM requested Snap dump using '/home/cli/ax-exp/openj9/Artemis/Snap.20230315.211228.900368.0003.trc' in response to an event
JVMDUMP010I Snap dump written to /home/cli/ax-exp/openj9/Artemis/Snap.20230315.211228.900368.0003.trc
JVMDUMP013I Processed dump event "traceassert", detail "".

The last run (log.d3ZAp) crashed with a different assertion failure:

07:29:17.836 0x1aa00    j9mm.107    *   ** ASSERTION FAILED ** at /root/hostdir/openj9-openjdk-jdk11/omr/gc/base/MemoryPoolAddressOrderedList.cpp:1428: ((false && ((__null == nextFreeEntry) || (addrTop <= nextFreeEntry))))
JVMDUMP039I Processing dump event "traceassert", detail "" at 2023/03/28 15:29:17 - please wait.
JVMDUMP032I JVM requested System dump using '/home/simon/Desktop/test-jitcomp/ax-eval/bugs2/12.openj9/mutant/red/core.20230328.152917.2166141.0001.dmp' in response to an event
JVMPORT030W /proc/sys/kernel/core_pattern setting "|/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" specifies that the core dump is to be piped to an external program.  Attempting to rename either core or core.2166212.  Review the manual for the external program to find where the core dump is written and ensure the program does not truncate it.

JVMPORT049I The core file created by child process with pid = 2166212 was not found. Review the documentation for the /proc/sys/kernel/core_pattern program "|/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" to find where the core file is written and ensure that program does not truncate it.

JVMDUMP012E Error in System dump: /home/simon/Desktop/test-jitcomp/ax-eval/bugs2/12.openj9/mutant/red/core.20230328.152917.2166141.0001.dmp
JVMDUMP032I JVM requested Java dump using '/home/simon/Desktop/test-jitcomp/ax-eval/bugs2/12.openj9/mutant/red/javacore.20230328.152917.2166141.0002.txt' in response to an event
JVMDUMP010I Java dump written to /home/simon/Desktop/test-jitcomp/ax-eval/bugs2/12.openj9/mutant/red/javacore.20230328.152917.2166141.0002.txt
JVMDUMP032I JVM requested Snap dump using '/home/simon/Desktop/test-jitcomp/ax-eval/bugs2/12.openj9/mutant/red/Snap.20230328.152917.2166141.0003.trc' in response to an event
JVMDUMP010I Snap dump written to /home/simon/Desktop/test-jitcomp/ax-eval/bugs2/12.openj9/mutant/red/Snap.20230328.152917.2166141.0003.trc
JVMDUMP013I Processed dump event "traceassert", detail "".
connglli commented 1 year ago

For the rest 11, see issue17052rest11.tar.gz

For convenience, I summarize them in the following table. I hope this can help you, simplifying your work (I know it's somewhat a headache, so I'm sorry). I put each test per directory named X.openj9, each following the same directory structure as the main issue mentioned above. In the table, the key perhaps_unique is the unique symptom (if it is) that I wanted to display to you. But as you may noticed, for some of them, I'm not able to reproduce for a second time. The key log.XXXXX is a summary of the failure I have encountered when trying to reproduce the corresponding test.

In addition, besides the (perhaps) unique symptom, there're chances that different tests can lead to the same crash symptom because of their non-determinism (and some tests can even trigger crashes that I've reported before like #15476). This indicate they are maybe, not unique?

3.openj9
- perhaps_unique: assertion failed ScavengerDelegate.cpp:469: ((false))
- log.1B22R: assertion failed ScavengerDelegate.cpp:469: ((false))
- log.DZo00: assertion failed Scavenger.cpp:1669: ((false))
- log.xhKX6: segmentation error ZN12MM_Scavenger14copyForVariantILb0EEEP8J9ObjectP22MM_EnvironmentStandardP18MM_ForwardedHeader+0x77e

5.openj9 
- perhaps_unique: assertion failed ScavengerRootScanner.hpp:105: ((MM_StackSlotValidator(MM_StackSlotValidator::COULD_BE_FORWARDED, *slotPtr, stackLocation, walkState).validate(_env)))
- log.vfCGQ: assertion failed ScavengerRootScanner.hpp:105: ((MM_StackSlotValidator(MM_StackSlotValidator::COULD_BE_FORWARDED, *slotPtr, stackLocation, walkState).validate(_env)))
- log.3u8ai: assertion failed ScavengerRootScanner.hpp:109: ((MM_StackSlotValidator(MM_StackSlotValidator::NOT_ON_HEAP, *slotPtr, stackLocation, walkState).validate(_env)))
- log.QDWaj: assertion failed ScavengerRootScanner.hpp:109: ((MM_StackSlotValidator(MM_StackSlotValidator::NOT_ON_HEAP, *slotPtr, stackLocation, walkState).validate(_env)))

6.openj9
- perhaps_unique: assertion failed Scavenger.cpp:1634: ((false))
- log.ES0IO: assertion failed Scavenger.cpp:1669: ((false))
- log.DHfO3: segmentation error _ZN18MM_ForwardedHeader17copyOrWaitOutlineEP8J9Object+0x30
- log.eJxJG: assertion failed ScavengerDelegate.cpp:469: ((false))

7.openj9
- perhaps_unique: segmentation error old_fast_jitLookupInterfaceMethod+0x4d
- log.wER9m: segmentation error old_fast_jitLookupInterfaceMethod+0x4d
- log.g4c0e: assertion failed ScavengerDelegate.cpp:469: ((false))
- log.AYVgj: assertion failed Scavenger.cpp:2731: ((false))

8.openj9
- perhaps_unique: assertion failed SweepPoolManagerAddressOrderedListBase.cpp:481: ((false && (objectSizeDelta <= heapFreeByteCount)))
- log.jYkx3: assertion failed SweepPoolManagerAddressOrderedListBase.cpp:481: ((false && (objectSizeDelta <= heapFreeByteCount)))
- log.26AEM: assertion failed SweepPoolManagerAddressOrderedListBase.cpp:481: ((false && (objectSizeDelta <= heapFreeByteCount)))
- log.PFDIo: segme error _ZN22GC_ObjectModelDelegate29calculateObjectDetailsForCopyEP18MM_EnvironmentBaseP18MM_ForwardedHeaderPmS4_S4_+0x34

9.openj9
- perhaps_unique: segmentation error _ZN23MM_TLHAllocationSupport7refreshEP18MM_EnvironmentBaseP22MM_AllocateDescriptionb+0x338 
- log.0gxUk: segmentation error _ZN23MM_TLHAllocationSupport7refreshEP18MM_EnvironmentBaseP22MM_AllocateDescriptionb+0x338 
- log.4f6XI: segmentation error _ZN22GC_ObjectModelDelegate29calculateObjectDetailsForCopyEP18MM_EnvironmentBaseP18MM_ForwardedHeaderPmS4_S4_+0x34
- log.nBVOa: segmentation error _ZN22GC_ObjectModelDelegate29calculateObjectDetailsForCopyEP18MM_EnvironmentBaseP18MM_ForwardedHeaderPmS4_S4_+0x34

11.openj9
- perhaps_unique: segmentation error findFieldInClass+0x99
- log.R1thu: assertion failed ScavengerRootScanner.hpp:109: ((MM_StackSlotValidator(MM_StackSlotValidator::NOT_ON_HEAP, *slotPtr, stackLocation, walkState).validate(_env)))
- log.COHB9: segmentation error _ZN23MM_ScavengerRootClearer26scavengeUnfinalizedObjectsEP22MM_EnvironmentStandard+0x300
- log.E7Gfs: assertion failed ScavengerRootClearer.cpp:198: ((false && (__null != forwardedPtr)))

13.openj9
- perhaps_unique: segmentation error haltThreadForInspection+0xa3 
- log.nUWNA: segmentation error _ZN31MM_MemoryPoolAddressOrderedList11allocateTLHEP18MM_EnvironmentBaseP22MM_AllocateDescriptionmRPvS5_+0x7f
- log.cfpHi segmentation error protectedBacktrace+0x12
- log.QUpX3 segmentation error protectedBacktrace+0x12

14.openj9
- perhaps_unique: assertion failed MemorySubSpace.cpp:185: ((false && (reason <= LOA_EXPAND_LAST_RESIZE_REASON)))
- log.9CGLb: assertion failed MemorySubSpace.cpp:185: ((false && (reason <= LOA_EXPAND_LAST_RESIZE_REASON)))

15.openj9
- perhaps_unique: assertion failed Scavenger.cpp:2731: ((false))
- log.cYvMU: assertion failed Scavenger.cpp:2731: ((false))
- log.TJ8np: assertion failed Scavenger.cpp:2731: ((false))
- log.3FZkI: assertion failed ConcurrentMarkingDelegate.cpp:58: ((MM_StackSlotValidator(0, object, stackLocation, walkState).validate(data->env)))
- log.khhfj: segmentation error romFieldSize+0xc3
- log.tD88r: assertion failed ScavengerDelegate.cpp:469: ((false))

16.openj9
- perhaps_unique: assertion failed MarkingSchemeRootMarker.cpp:48: ((MM_StackSlotValidator(0, object, stackLocation, walkState).validate(_env)))
- log.0xAEW: assertion failed Scavenger.cpp:1669: ((false))
- log.N97lH: segmentation error _ZN22GC_ObjectModelDelegate29calculateObjectDetailsForCopyEP18MM_EnvironmentBaseP18MM_ForwardedHeaderPmS4_S4_+0x34 
- log.vqj3w: segmentation error _ZN12MM_Scavenger16rescanThreadSlotEP22MM_EnvironmentStandardPP8J9Object+0xea 
- log.yTdHX: segmentation error _ZN18MM_ForwardedHeader17copyOrWaitOutlineEP8J9Object+0x30
pshipton commented 1 year ago

@0xdaryl @hzongaro

dmitripivkine commented 1 year ago

Most of these assertions in GC might be triggered by heap corruption. There are a very few of them I need to double check, the connection with possible heap corruption is not obvious.

dmitripivkine commented 1 year ago

I walked this list of assertions/crashes in GC. There is only one of them can not be explained by heap corruption directly:

If somebody needs triaging of failures in GC listed above please let me know, I will help with this. Of course, system core is necessary.

connglli commented 1 year ago

Oops. That happens to be the test I was unable to reproduce (and my procedure deleted the log periodically)..... Still hope that test could be helpful, to some extend. I'll try again to see if I can reproduce it in more runs.

connglli commented 1 year ago

I have reproduced the failure mentioned by @dmitripivkine. See logs: issue17052rest11.14.tar.gz.

I edited and also updated the issue17052rest11.tar.gz in my previous comment.

dmitripivkine commented 1 year ago

The problem starts at attempt to allocate object with 1836453136 bytes size. With -Xmx1G specified it is not possible obviously:

14:52:37.237385000                *0x0       j9mm.133  Event       Allocation failure start: newspace=8080/2097152 oldspace=5649544/6291456 loa=0/0 requestedbytes=1836453136

@connglli Do you know is allocation of this array of such size is intentional for this test load?

dmitripivkine commented 1 year ago

@connglli The rational for question above:

connglli commented 1 year ago

@dmitripivkine The code is generated by our fuzzers. In our fuzzer implementation, we have restricted the size of any array allocation. I have checked the code again, and didn't notice any explicit array allocation of which the size is larger than 524,288(=256x256x8) bytes. But we did generate some library method calls like those of Double, Float, BigDecimal, BigInteger, etc. I suppose they don't have some implicit large array allocations, either (even though I'm not quite sure).

connglli commented 1 year ago

Could you please check the stack trace such that we can get some glue on which array allocation caused this assertion failure? By the way, I'm sorry I'm on my vacation so I might not be that on call.

dmitripivkine commented 1 year ago

Could you please check the stack trace such that we can get some glue on which array allocation caused this assertion failure? By the way, I'm sorry I'm on my vacation so I might not be that on call.

3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at java/nio/HeapByteBuffer.<init>(HeapByteBuffer.java:61)
4XESTACKTRACE                at java/nio/ByteBuffer.allocate(ByteBuffer.java:348)
4XESTACKTRACE                at Test.mainTest(Test.java:600(Compiled Code))
4XESTACKTRACE                at Test.main(Test.java:726)
connglli commented 1 year ago

Oh sure yes, it is a library call ax$45.write(ByteBuffer.allocate(i22)). So for this test an OOM should be thrown instead of an assertion failure. I forgot to check the size of the argument i22 in my fuzzer implementation. But seems that it is a good design to abandon all size checks so that we can find some corner-cases like this :-).

How about the others, I didn't notice ByteBuffer.allocate() in the test like the main comment.

dmitripivkine commented 1 year ago

An assertion in MemorySubSpace.cpp:185 is an obvious problem in GC code (and as a GC Team member I have a priority). The rest of listed assertions are triggered by external to GC reasons (bad O-slot in java stack or heap corruption). I can triage such assertion/crash problems and assign to right component (JIT most likely) but I need system dump to look at. And I don't see any at the moment. Do you have any system dumps left around?

hzongaro commented 1 year ago

I was able to reproduce this failure with this option:

-Xjit:limit={Test.fMeth*},optLevel=hot,lastOptIndex=84

but not with lastOptIndex=83. In this case, opt index 84 was Loop Versioner. The transformations listed for that method in Loop Versioner were

[  3719]  84.1    O^O LOOP VERSIONER: Versioning natural loop 326
[  3720]  84.2    O^O LOOP VERSIONER:  Creating test outside loop for deciding if async check is required
[  3721]  84.3    O^O LOOP VERSIONER:  Creating test outside loop for checking if n4045n [00007F7EA875E010] exceeds bounds

This bears some resemblance to the situation described in issue #17249, so it could be related. I'll assign it to myself for now, in case this is a duplicate.

hzongaro commented 1 year ago

I haven't been able to make progress on this one for 0.41. I will move it to the 0.43 release.

hzongaro commented 1 year ago

I think this is likely a duplicate of #17249, so I'll assign it to @a7ehuo.

a7ehuo commented 12 months ago

I can still reproduce the GC assert with the fix for #17249. This issue is also reproducible with disableAsyncCheckVersioning.

-Xjit:limit={Test.fMeth*},{*}(disableAsyncCheckVersioning,optLevel=hot,lastOptIndex=84,optDetails,traceFull,traceILGen,traceCG,traceLoopVersioner,log=logs/jitlog)' Test
Bad scan type for object pointer 00000000A3617EA8
14:37:34.934 0x1aa00    j9mm.141    *   ** ASSERTION FAILED ** at /root/home/ahuo/src/openj9-openjdk-jdk17/openj9/runtime/gc_glue_java/ScavengerDelegate.cpp:480: ((false))

The GC assert happens with lastOptSubIndex=2 but not lastOptSubIndex=1. This issue seems related to versioning bound check. Meanwhile neither _conditionalTree nor _duplicatedContionalTree is set, which is related to the root cause of #17249.

[  1974]  84.1    O^O LOOP VERSIONER: Versioning natural loop 14
versionNaturalLoop: DEBUG calling buildBoundCheckComparisonsTree. _conditionalTree 0000000000000000 n-1n _duplicatedContionalTree 0000000000000000 n-1n
[  1975]  84.2    O^O LOOP VERSIONER:  Creating test outside loop for checking if n824n [00007F723047F140] exceeds bounds
hzongaro commented 12 months ago

Thanks for checking into that, Annabelle @a7ehuo. Brad @BradleyWood, following that investigation I'm now wondering whether this might be a duplicate of issue #17045, which you've been investigating.

BradleyWood commented 12 months ago

Yes, I think its the same problem.

I see the same kind of snippet.

try {
    for (int ax$16 = 0; ax$16 <= 3; ax$16++) {
        for (int ax$17 = 0; ax$17 <= 3; ax$17++) {
            ax$15[ax$16 * ax$17] += 1;
        }
    }
} catch (Throwable ax$18) {
} finally {
}

I'm guessing the loop limit transformation is off by one. Does changing ax$17 <= 3 to ax$17 < 4 fix it?

a7ehuo commented 12 months ago

Does changing ax$17 <= 3 to ax$17 < 4 fix it?

Yes. The test passes with the change

                        for (int ax$16 = 0; ax$16 <= 3; ax$16++) {
                            for (int ax$17 = 0; ax$17 < 4; ax$17++) { // Pass
                                ax$15[ax$16 * ax$17] += 1;
                            }
hzongaro commented 12 months ago

I'll reassign this to Brad @BradleyWood, as it looks like this is likely a duplicate of #17045

a7ehuo commented 12 months ago

Just post what I have observed on the failure from this test. Hope it helps. Brad will continue the investigation.

I'm guessing the loop limit transformation is off by one

I noticed the difference between ax$17 <= 3 and ax$17 < 4 is adjustMaxValue, which is calculated from incrementJ based on isIndexChildMultiplied.

ax$17 <= 3 case:

ax$17 < 4 case:

isIndexChildMultiplied is different for these two cases from what I observed. It results in the final maxValue for ax$17 as 2 when ax$17 <= 3. I haven't got to trace why isIndexChildMultiplied is different.

a7ehuo commented 12 months ago

isIndexChildMultiplied is different for these two cases from what I observed. It results in the final maxValue for ax$17 as 2 when ax$17 <= 3.

Sorry I realized my previous logging is incorrect. It is the incrementJ that's different for these two cases.

ax$17 <= 3 case:

ax$17 < 4 case:

BradleyWood commented 12 months ago

Those values are determined by the type of compare operator. I guess the logic doesn't hold when the indices are product of two induction variables?

hzongaro commented 9 months ago

@BradleyWood, may I ask you to check whether this problem was fixed by either or both the OMR pull requests eclipse/omr#7237 and eclipse/omr#7238?

BradleyWood commented 9 months ago

@hzongaro Its fixed