eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 721 forks source link

jdk22 WrapperHiddenClassTest::testModule segfault in libj9gc #18805

Closed babsingh closed 7 months ago

babsingh commented 9 months ago

https://openj9-jenkins.osuosl.org/job/Test_openjdk22_j9_sanity.openjdk_aarch64_linux_Personal_testList_1/1 jdk_lang java/lang/invoke/MethodHandleProxies/WrapperHiddenClassTest.java

16:33:41  STARTED    WrapperHiddenClassTest::testModule '[2] interface java.lang.Runnable'
16:33:41  Unhandled exception
16:33:41  Type=Segmentation error vmState=0x00000000
16:33:41  J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
16:33:41  Handler1=0000FFFF8F8C3314 Handler2=0000FFFF8F827B60 InaccessibleAddress=0000000000000008
16:33:41  R0=0000000000000001 R1=0000000000000000 R2=0000FFFF88033FB0 R3=0000FFFF8DA2FD40
16:33:41  R4=0000FFFF8857C590 R5=0000000000000041 R6=0000FFFF15470860 R7=0000000000000001
16:33:41  R8=FFFFFFFFFFFFFFFF R9=0000000FFFFFFFFF R10=0000008800000000 R11=0000000000000001
16:33:41  R12=0000000000000000 R13=000000000000270F R14=0000000000000002 R15=00000000000000F0
16:33:41  R16=0000FFFF8FAC02E0 R17=0000FFFF8FF2D3C0 R18=0000000000000001 R19=0000FFFF88033FB0
16:33:41  R20=0000FFFF87DBE658 R21=0000FFFF4C8CDEF8 R22=0000FFFF880349B0 R23=0000000000000001
16:33:41  R24=0000FFFF8FE22EF8 R25=0000FFFF880D6848 R26=0000FFFF87E57898 R27=0000FFFF4C8CDEA8
16:33:41  R28=0000FFFF885AE300 R29=0000FFFF4C8CDE20 R30=0000FFFF8F8EC228 R31=0000FFFF4C8CDE20
16:33:41  PC=0000FFFF8DA2F194 SP=0000FFFF4C8CDE20 PSTATE=0000000060001000
16:33:41  V0 2f2f2f2f2f2f2f2f (f: 791621440.000000, d: 2.054695e-81)
16:33:41  V1 7461746f6e6e412f (f: 1852719360.000000, d: 3.999097e+252)
16:33:41  V2 7372617000100072 (f: 1048690.000000, d: 1.285158e+248)
16:33:41  V3 0000008800000000 (f: 0.000000, d: 2.885914e-312)
16:33:41  V4 0000000080000800 (f: 2147485696.000000, d: 1.060999e-314)
16:33:41  V5 0000000000000001 (f: 1.000000, d: 4.940656e-324)
16:33:41  V6 0000000000000000 (f: 0.000000, d: 0.000000e+00)
16:33:41  V7 8020080280200802 (f: 2149582848.000000, d: -4.458850e-308)
16:33:41  V8 000003c10000745f (f: 29791.000000, d: 2.039238e-311)
16:33:41  V9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
16:33:41  V10 0000000000000000 (f: 0.000000, d: 0.000000e+00)
16:33:41  V11 0000000000000000 (f: 0.000000, d: 0.000000e+00)
16:33:41  V12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
16:33:41  V13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
16:33:41  V14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
16:33:41  V15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
16:33:41  V16 4010040140100401 (f: 1074791424.000000, d: 4.003911e+00)
16:33:41  V17 000000fe00000000 (f: 0.000000, d: 5.389869e-312)
16:33:41  V18 8020080280200800 (f: 2149582848.000000, d: -4.458850e-308)
16:33:41  V19 3f9eb851eb851eb8 (f: 3951369984.000000, d: 3.000000e-02)
16:33:41  V20 3fb1eb851eb851ec (f: 515396064.000000, d: 7.000000e-02)
16:33:41  V21 0000000000000008 (f: 8.000000, d: 3.952525e-323)
16:33:41  V22 3f0000003f800000 (f: 1065353216.000000, d: 3.051759e-05)
16:33:41  V23 3fc999999999999a (f: 2576980480.000000, d: 2.000000e-01)
16:33:41  V24 3fd6666666666666 (f: 1717986944.000000, d: 3.500000e-01)
16:33:41  V25 0000000000000000 (f: 0.000000, d: 0.000000e+00)
16:33:41  V26 0000000000000001 (f: 1.000000, d: 4.940656e-324)
16:33:41  V27 000000000000000a (f: 10.000000, d: 4.940656e-323)
16:33:41  V28 0000000000000000 (f: 0.000000, d: 0.000000e+00)
16:33:41  V29 0000000000000000 (f: 0.000000, d: 0.000000e+00)
16:33:41  V30 4030000000000000 (f: 0.000000, d: 1.600000e+01)
16:33:41  V31 000000003f400000 (f: 1061158912.000000, d: 5.242822e-315)
16:33:41  Module=/home/jenkins/workspace/Test_openjdk22_j9_sanity.openjdk_aarch64_linux_Personal_testList_1/jdkbinary/j2sdk-image/lib/default/libj9gc_full29.so
16:33:41  Module_base_address=0000FFFF8D9F0000
16:33:41  Target=2_90_20240122_5 (Linux 4.18.0-348.7.1.el8_5.aarch64)
16:33:41  CPU=aarch64 (8 logical CPUs) (0x1dc1b0000 RAM)
16:33:41  ----------- Stack Backtrace -----------
16:33:41  stringHashEqualFn+0x170 (0x0000FFFF8DA2F194 [libj9gc_full29.so+0x3f194])
16:33:41  moduleNameHashEqualFn+0xa4 (0x0000FFFF8F8EC228 [libj9vm29.so+0x5c228])
16:33:41  hashTableFind+0x130 (0x0000FFFF8FDEDD00 [libjvm.so+0x2dd00])
16:33:41  isModuleDefined+0x34 (0x0000FFFF8FDDEFC4 [libjvm.so+0x1efc4])
16:33:41  JVM_DefineModule+0xf4c (0x0000FFFF8FDD242C [libjvm.so+0x1242c])
16:33:41   (0x0000FFFF4EB5D98C [<unknown>+0x0])
16:33:41  ---------------------------------------
16:33:41  JVMDUMP039I Processing dump event "gpf", detail "" at 2024/01/22 16:33:36 - please wait.
16:33:41  JVMDUMP032I JVM requested System dump using '/home/jenkins/workspace/Test_openjdk22_j9_sanity.openjdk_aarch64_linux_Personal_testList_1/aqa-tests/TKG/output_17059590457223/jdk_lang_1/work/scratch/0/core.20240122.163336.3187330.0001.dmp' in response to an event
16:33:41  JVMDUMP010I System dump written to /home/jenkins/workspace/Test_openjdk22_j9_sanity.openjdk_aarch64_linux_Personal_testList_1/aqa-tests/TKG/output_17059590457223/jdk_lang_1/work/scratch/0/core.20240122.163336.3187330.0001.dmp
16:33:41  JVMDUMP032I JVM requested Java dump using '/home/jenkins/workspace/Test_openjdk22_j9_sanity.openjdk_aarch64_linux_Personal_testList_1/aqa-tests/TKG/output_17059590457223/jdk_lang_1/work/scratch/0/javacore.20240122.163336.3187330.0002.txt' in response to an event
16:33:41  JVMDUMP010I Java dump written to /home/jenkins/workspace/Test_openjdk22_j9_sanity.openjdk_aarch64_linux_Personal_testList_1/aqa-tests/TKG/output_17059590457223/jdk_lang_1/work/scratch/0/javacore.20240122.163336.3187330.0002.txt
16:33:41  JVMDUMP032I JVM requested Snap dump using '/home/jenkins/workspace/Test_openjdk22_j9_sanity.openjdk_aarch64_linux_Personal_testList_1/aqa-tests/TKG/output_17059590457223/jdk_lang_1/work/scratch/0/Snap.20240122.163336.3187330.0003.trc' in response to an event
16:33:41  JVMDUMP010I Snap dump written to /home/jenkins/workspace/Test_openjdk22_j9_sanity.openjdk_aarch64_linux_Personal_testList_1/aqa-tests/TKG/output_17059590457223/jdk_lang_1/work/scratch/0/Snap.20240122.163336.3187330.0003.trc
16:33:41  JVMDUMP032I JVM requested JIT dump using '/home/jenkins/workspace/Test_openjdk22_j9_sanity.openjdk_aarch64_linux_Personal_testList_1/aqa-tests/TKG/output_17059590457223/jdk_lang_1/work/scratch/0/jitdump.20240122.163336.3187330.0004.dmp' in response to an event
16:33:41  JVMDUMP051I JIT dump occurred in 'MainThread' thread 0x0000FFFF885AE300
16:33:41  JVMDUMP053I JIT dump is recompiling java/lang/Module.<init>(Ljava/lang/ModuleLayer;Ljava/lang/ClassLoader;Ljava/lang/module/ModuleDescriptor;Ljava/net/URI;)V
16:33:41  JVMDUMP010I JIT dump written to /home/jenkins/workspace/Test_openjdk22_j9_sanity.openjdk_aarch64_linux_Personal_testList_1/aqa-tests/TKG/output_17059590457223/jdk_lang_1/work/scratch/0/jitdump.20240122.163336.3187330.0004.dmp
16:33:41  JVMDUMP013I Processed dump event "gpf", detail "".
dmitripivkine commented 9 months ago

We observed this kind of problem before. Looks like NULL is sent instead of the string. Is there missed NULL check? Or something is NULL unexpectedly?

babsingh commented 9 months ago

Looks like NULL is sent instead of the string. Is there missed NULL check? Or something is NULL unexpectedly?

At the point of the crash, the object does not seem to be NULL.

// Native stack at the point of crash
#13 0x00007f08322c99ba in stringHashEqualFn (leftKey=<optimized out>, rightKey=0x7f07d837b7a8, userData=0x7f082c018420) at /root/openj9-openjdk-jdk22/openj9/runtime/gc_base/StringTable.cpp:363
#14 0x00007f08336af796 in moduleNameHashEqualFn (tableNode=<optimized out>, queryNode=<optimized out>, userData=0x7f082c018420) at /root/openj9-openjdk-jdk22/openj9/runtime/vm/ModularityHashTables.c:67
#15 0x00007f08338f2c48 in hashTableFindNodeSpaceOpt (head=<optimized out>, entry=<optimized out>, table=<optimized out>) at /root/openj9-openjdk-jdk22/omr/util/hashtable/hashtable.c:468
#16 hashTableFind (table=0x7f082c0b5b60, entry=0x7f07d837b7e8) at /root/openj9-openjdk-jdk22/omr/util/hashtable/hashtable.c:452
#17 0x00007f08338e35ca in isModuleDefined (currentThread=<optimized out>, fromModule=<optimized out>) at /root/openj9-openjdk-jdk22/openj9/runtime/util/modularityHelper.c:50
#18 0x00007f08338d692d in addModuleDefinition (version=0x0, numPackages=1, packages=0x7f07ac04dee0, fromModule=<optimized out>, currentThread=<optimized out>) at /root/openj9-openjdk-jdk22/openj9/runtime/j9vm/java11vmi.c:423
#19 JVM_DefineModule (env=<optimized out>, module=0x2932d0, isOpen=<optimized out>, version=0x0, location=<optimized out>, packageArray=<optimized out>) at /root/openj9-openjdk-jdk22/openj9/runtime/j9vm/java11vmi.c:828

(gdb) f 13
#13 0x00007f08322c99ba in stringHashEqualFn (leftKey=<optimized out>, rightKey=0x7f07d837b7a8, userData=0x7f082c018420) at /root/openj9-openjdk-jdk22/openj9/runtime/gc_base/StringTable.cpp:363
363     leftLength = J9VMJAVALANGSTRING_LENGTH_VM(javaVM, left_s);

(gdb) p left_s
$1 = (j9object_t) 0xd00ddf18

(gdb) l
358
359     stu8Ptr = *((UDATA*) rightKey); /* rightKey may be a pointer to a low-tagged pointer to a struct */
360
361     /* Get at the String information  */
362     left_s = *(j9object_t *)leftKey;
363     leftLength = J9VMJAVALANGSTRING_LENGTH_VM(javaVM, left_s);
364     left_p = J9VMJAVALANGSTRING_VALUE_VM(javaVM, left_s);
365     leftCompressed = IS_STRING_COMPRESSED_VM(javaVM, left_s);
366
367     if (stu8Ptr & TYPE_UTF8) {

> !j9object 0xd00ddf18
!J9IndexableObject 0x00000000D00DDF18 {
    struct J9Class* clazz = !j9arrayclass 0x76200   // [L
    Object flags = 0x00000000;
    U_32 size = 0x00000005;
    [0] = !fj9object 0xd00010d8 = !j9object 0xd00010d8
    [1] = !fj9object 0xd00010d8 = !j9object 0xd00010d8
    [2] = !fj9object 0xd00010d8 = !j9object 0xd00010d8
    [3] = !fj9object 0xd0001768 = !j9object 0xd0001768
    [4] = !fj9object 0xd00010d8 = !j9object 0xd00010d8

// #19 JVM_DefineModule (env=<optimized out>, module=0x2932d0, isOpen=<optimized out>, version=0x0, location=<optimized out>, packageArray=<optimized out>) at /root/openj9-openjdk-jdk22/openj9/runtime/j9vm/java11vmi.c:828
// Tracking module=0x2932d0

> !j9x 0x2932d0
0x002932D0 :  00000000d0282920 00000000000b7c78 [  )(.....x|...... ]

> !j9object 0x00000000d0282920
!J9Object 0x00000000D0282920 {
    struct J9Class* clazz = !j9class 0x73700 // java/lang/Module
    Object flags = 0x00000000;
    I lockword = 0x00000000 (offset = 0) (java/lang/Object) <hidden>
    Ljava/lang/ModuleLayer; layer = !fj9object 0xd014f4a0 (offset = 12) (java/lang/Module)
    Ljava/lang/String; name = !fj9object 0xd0281df8 (offset = 16) (java/lang/Module)
    Ljava/lang/ClassLoader; loader = !fj9object 0x0 (offset = 20) (java/lang/Module)
    Ljava/lang/module/ModuleDescriptor; descriptor = !fj9object 0xd0282470 (offset = 24) (java/lang/Module)
    Z enableNativeAccess = 0x00000000 (offset = 44) (java/lang/Module)
    Ljava/util/Set; reads = !fj9object 0x0 (offset = 28) (java/lang/Module)
    Ljava/util/Map; openPackages = !fj9object 0x0 (offset = 32) (java/lang/Module)
    Ljava/util/Map; exportedPackages = !fj9object 0x0 (offset = 36) (java/lang/Module)
    Ljava/lang/Class; moduleInfoClass = !fj9object 0x0 (offset = 40) (java/lang/Module)
    J modulePointer = 0x00007F082C74AAC0 (offset = 4) (java/lang/Module) <hidden>
}

> !fj9object 0xd0281df8
J9VMJavaLangString at 0x00000000D0281DF8 {
struct J9Class* clazz = !j9class 0x50800 // java/lang/String
Object flags = 0x00000000;
[B value = !fj9object 0xd0281e10 (offset = 0) (java/lang/String)
B coder = 0x00000000 (offset = 4) (java/lang/String)
I hash = 0xC65C1FE0 (offset = 8) (java/lang/String)
Z hashIsZero = 0x00000000 (offset = 12) (java/lang/String)
"jdk.MHProxy6"
}

> !j9module 0x00007F082C74AAC0
J9Module at 0x7f082c74aac0 {
  Fields for J9Module:
    0x0: struct J9Object* moduleName = !j9object 0x00000000D0281DF8 // java/lang/String
    0x8: struct J9Object* moduleObject = !j9object 0x00000000D0282920 // java/lang/Module
    0x10: struct J9Object* version = !j9object 0x0000000000000000
    0x18: struct J9ClassLoader* classLoader = !j9classloader 0x00007F082C091358
    0x20: struct J9HashTable* readAccessHashTable = !j9hashtable 0x00007F07AC198E00
    0x28: struct J9HashTable* removeAccessHashTable = !j9hashtable 0x0000000000000000
    0x30: struct J9HashTable* removeExportsHashTable = !j9hashtable 0x0000000000000000
    0x38: U32 isLoose = 0x00000000 (0)
    0x3c: U32 isOpen = 0x00000000 (0)
}
Module name: jdk.MHProxy6
dmitripivkine commented 9 months ago

I added comp:vm label to keep this issue on VM radar. Even before looking to the core for !j9module 0x00007F082C74AAC0 I noticed "version" object is NULL. Is it acceptable?

dmitripivkine commented 9 months ago

Looking to original core: Crash occur at NULL passed obviously: InaccessibleAddress=0000000000000008

> !j9object 0x0000FFFF87E57898
J9VMJavaLangString at 0x0000FFFF87E57898 {
struct J9Class* clazz = !j9class 0xFFFF8828C200 // java/lang/String
Object flags = 0x00000000;
[B value = !fj9object 0xffff87e578b8 (offset = 0) (java/lang/String)
B coder = 0x00000000 (offset = 8) (java/lang/String)
I hash = 0xC65C1FE0 (offset = 12) (java/lang/String)
Z hashIsZero = 0x00000000 (offset = 16) (java/lang/String)
"jdk.MHProxy6"
}
> !j9module 0xffff8857c740
J9Module at 0xffff8857c740 {
  Fields for J9Module:
    0x0: struct J9Object* moduleName = !j9object 0x0000FFFF87E57898 // java/lang/String
    0x8: struct J9Object* moduleObject = !j9object 0x0000FFFF87E58BA8 // java/lang/Module
    0x10: struct J9Object* version = !j9object 0x0000000000000000
    0x18: struct J9ClassLoader* classLoader = !j9classloader 0x0000FFFF880D6848
    0x20: struct J9HashTable* readAccessHashTable = !j9hashtable 0x0000FFFEE42B8090
    0x28: struct J9HashTable* removeAccessHashTable = !j9hashtable 0x0000000000000000
    0x30: struct J9HashTable* removeExportsHashTable = !j9hashtable 0x0000000000000000
    0x38: U32 isLoose = 0x00000000 (0)
    0x3c: U32 isOpen = 0x00000000 (0)
}
Module name: jdk.MHProxy6
dmitripivkine commented 9 months ago

BTW many DDR modules related commands don't work with this core, for example !findmodules all, !findallmodules, !findmodulebyname jdk.MHProxy6. I don't know is there the problem with modules structure in this core or DDR code itself does not work.

babsingh commented 9 months ago

There are two WrapperHiddenClassTest failures: this issue and https://github.com/eclipse-openj9/openj9/issues/18804. The failures only happen with gencon and optthruput GC policies. WrapperHiddenClassTest consistently passes with metronome and balanced GC policies.

babsingh commented 8 months ago

@dmitripivkine As requested, the zip contains a system core, java core and JDK22 created on Linux x64: https://ibm.box.com/shared/static/f8e1083e2laz5rwp9r33nua52fx0atv3.zip.

dmitripivkine commented 8 months ago

@babsingh Thank you for core/jdk, it was very useful!

I have traced failure back to !j9module 0x00007FD62425E7F8 (modularityPool = !j9pool 0x00007FD624090D90 element [34] = 0x00007FD62425E7F8):

> !j9module 0x00007FD62425E7F8
J9Module at 0x7fd62425e7f8 {
  Fields for J9Module:
    0x0: struct J9Object* moduleName = !j9object 0x00000007FFE387B0 // java/lang/ref/WeakReference <--- bad
    0x8: struct J9Object* moduleObject = !j9object 0x00000007FFE3B058 // <FAULT> <--- bad
    0x10: struct J9Object* version = !j9object 0x0000000000000000
    0x18: struct J9ClassLoader* classLoader = !j9classloader 0x00007FD62408FD88 <--- good
    0x20: struct J9HashTable* readAccessHashTable = !j9hashtable 0x00007FD5C8012D70 <--- good
    0x28: struct J9HashTable* removeAccessHashTable = !j9hashtable 0x00007FD5C81B6D80 <--- good
    0x30: struct J9HashTable* removeExportsHashTable = !j9hashtable 0x0000000000000000
    0x38: U32 isLoose = 0x00000000 (0)
    0x3c: U32 isOpen = 0x00000000 (0)
}

moduleName and moduleObject of this module have bad object pointers in the Nursery. The actual crash occur an attempt to get module name from !j9object 0x00000007FFE387B0 an attempt to treat this object as String. But this happens to be WeakReference object, so attempt to read byte array address returns NULL. If you are interesting in low debugging details please ask me. So, the next question is how we got bad (stale?) object pointers in the module. Module fixup is a part of the Scavenge, so GC can not discover this module somehow.

babsingh commented 8 months ago

@dmitripivkine I discussed the issue with @JasonFengJ9 and @hangshao0. Here is the summary:

With the re-implementation of MethodHandleProxies.asInterfaceInstance in https://github.com/ibmruntimes/openj9-openjdk-jdk22/commit/3193b60dbd9b70d5b79503776006fcaa28031bbf (JDK22), Module objects are introduced, which are only associated to hidden classes. Hidden/anon classes can be unloaded independently, and they don't require the GC to clear the associated ClassLoader in order to be unloaded. As the hidden class is unloaded, there are no more references that point to the Module object, and the GC clears it without freeing the native J9Module structure. Currently, we assume that the Module objects will live until the ClassLoader is cleared; so, the native J9Module structures aren't cleared until the ClassLoader is cleared. With the introduction of short-lived Module objects, the GC will need to free the native J9Module structure whenever a Module object is cleared. The below code can be used to free the native J9Module structure:

IF A MODULE OBJECT IS BEING CLEARED:
    J9Module *j9mod = J9OBJECT_ADDRESS_LOAD(currentThread, moduleObj, vm->modulePointerOffset);
    if (NULL != j9mod) {
        vmFuncs->freeJ9Module(vm, j9mod);
    }
babsingh commented 8 months ago

Freeing the native J9Module structure should remove the bad entries from the various hash tables.

But, freeJ9Module(...) doesn't remove the J9Module entry from J9Module->classLoader->moduleHashTable.

So, we will need a more custom callback for the GC, where the J9Module entry is removed from all the hash tables.

dmitripivkine commented 8 months ago

I don't think this is a good idea to add such code at GC side. Instead why not to add this code at VM side to the function subscribed to J9HOOK_VM_ANON_CLASSES_UNLOAD? VM code has subscription for flushForClassesUnload() already.

dmitripivkine commented 8 months ago

But, freeJ9Module(...) doesn't remove the J9Module entry from J9Module->classLoader->moduleHashTable

Is classLoader here is special Anonymous Classloader (can not be unloaded)? Anonymous classes have special implementation. They are always listed in this special class loader but pointed to parent class loader (which is different one). Any way I don't see why VM can not do it itself. Is there a reason to switch to GC side? Class unloading process is equipped by various hooks VM is subscribed already.

dmitripivkine commented 8 months ago

FYI: Currently during class unloading we are triggering hooks:

babsingh commented 8 months ago

Is classLoader here is special Anonymous Classloader (can not be unloaded)?

No. Hidden/anon classes can be independently unloaded; it doesn't depend on the associated class loader to be cleared/unloaded.

I don't see why VM can not do it itself.

We can't link this step with class unloading because multiple hidden classes can be associated to the module. VM won't be able to determine the liveness of the module. For the VM to handle it, we will need a new hook which will be triggered once the module is cleared by the GC.

dmitripivkine commented 8 months ago

Is classLoader here is special Anonymous Classloader (can not be unloaded)?

No. Hidden/anon classes can be independently unloaded; it doesn't depend on the associated class loader to be cleared/unloaded.

The question was which class loader is referenced from module for Anonymous class: special Anonymous Classloader or Anonymous class logical parent class loader ?

dmitripivkine commented 8 months ago

VM won't be able to determine the liveness of the module.

What is a liveness condition for the module?

babsingh commented 8 months ago

The question was which class loader is referenced from module for Anonymous class: special Anonymous Classloader or Anonymous class logical parent class loader ?

In JDK22, only hidden classes are used; anon classes are not used. for a hidden class, the class loader is defined by the Java API; it can either be the lookup class's classloader or it can be the bootstrap classloader. MethodHandles.ClassDefiner has more details.

What is a liveness condition for the module?

When the GC clears the java.lang.Module object.

dmitripivkine commented 8 months ago

When the GC clears the java.lang.Module object.

As we see GC explicitly keeps java.lang.Module objects alive by searching them in classLoader->moduleHashTable. So, what is happening exactly?

tajila commented 8 months ago

I dont think there is anything special about this particular use case. A new module is created for the hidden class. The class will disappear when it is no longer referenced, but the module should remain alive until the classloader dies.

Ive verified that the j9module in question is in the classLoader->moduleHashTable. Ive also verified that it is always scanned during GC.

At the start of the first GC the module->moduleName is valid (Note: the name is tenured). However, on every susbsequent GC, module->moduleName changes. At the time of the crash it is pointing to a weak reference. The original String that module->moduleName pointed to at the time of the first GC is still valid, even at the time of the crash. AFAIK moduleName is never written to after module creation. So this would impl some kind of corruption.

babsingh commented 8 months ago

My observations are identical to @tajila's observations in https://github.com/eclipse-openj9/openj9/issues/18805#issuecomment-1926100759.

J9Module's life cycle:

The failure happens only when the below code is run, WrapperHiddenClassTest.java#L208-L210:

        if (!ForceGC.wait(() -> implClassRef.refersTo(null))) {
            // In ForceGC.wait,
            //   - System.gc() is invoked in a loop.
            //   - Module objects, which are out of scope, end up having invalid object references in J9Module.
            fail("MHP impl class cannot be cleared by GC");
        }
babsingh commented 8 months ago

[UPDATE] Tried the GC options provided by @dmitripivkine. The test passes with -Xnocompactgc -Xgcpolicy:[gencon|optthruput]. The segfault also stops with -Xnoclassgc, which prevents the hidden class from unloading. Compaction removes unmarked objects. It seems like the hidden class and the associated Module object are being collected before the GC gets to scan and mark modularity objects through J9ClassLoader->moduleHashTable.

Core files with -Xmx1g to ease reading the heap: https://ibm.box.com/shared/static/uztriv101q5wi070wlnu2mplzg4hx26b.zip

dmitripivkine commented 8 months ago

Looking to the latest core (with optthruput) am not sure it is related, probably not. However: From tracepoints - previous GC !j9class 0x499800 was unloaded:

17:05:45.835791061 *0x26CD00 j9mm.60              Event       Class unloading start
17:05:45.835791178 *0x0 j9mm.60              Event       Class unloading start
17:05:45.835791365 *0x26CD00 j9mm.281             Entry      >MM_ParallelGlobalGC_unloadDeadClassLoaders
17:05:45.835795028  0x26CD00 j9jvmti.352          Entry      >HookClassUnload class=0x499800
17:05:45.835795661  0x26CD00 j9jvmti.353          Exit       <HookClassUnload
17:05:45.835813900  0x26CD00 j9bcverify(j9vm).44  Entry      >unlinkClassLoadingConstraints
17:05:45.835815685  0x26CD00 j9bcverify(j9vm).45  Exit       <unlinkClassLoadingConstraints
17:05:45.835816187  0x26CD00 j9vm.109             Entry      >flushForClassesUnload
17:05:45.835816829  0x26CD00 j9vm.110             Exit       <flushForClassesUnload
17:05:45.836070188  0x26CD00 j9vm.109             Entry      >flushForClassesUnload
17:05:45.836071053  0x26CD00 j9vm.110             Exit       <flushForClassesUnload
17:05:45.836080328  0x26CD00 j9mm.282             Exit       <MM_ParallelGlobalGC_unloadDeadClassLoaders
17:05:45.836080473  0x26CD00 j9mm.94              Event       Class unloading end: classloadersunloaded=0 classesunloaded=1
17:05:45.836080845 *0x0 j9mm.94              Event       Class unloading end: classloadersunloaded=0 classesunloaded=1

this class was Anonymous (or Hidden) obviously. However in the core I can see class !j9class 0x499800, and it is not unloaded. The only possibility to get this situation is new class was loaded to the same place. It opens possibility to have mismatch between unloaded and newly loaded classes potentially.

And BTW this time bad pointer from corrupted !j9module 0x7f487c5ed798 is java/lang/invoke/MemberName

> !j9module 0x7f487c5ed798
J9Module at 0x7f487c5ed798 {
  Fields for J9Module:
    0x0: struct J9Object* moduleName = !j9object 0x00000000C00E9128 // java/lang/invoke/MemberName
    0x8: struct J9Object* moduleObject = !j9object 0x00000000C00E9410 // [Ljava/lang/Class;
    0x10: struct J9Object* version = !j9object 0x0000000000000000
    0x18: struct J9ClassLoader* classLoader = !j9classloader 0x00007F487C092518
    0x20: struct J9HashTable* readAccessHashTable = !j9hashtable 0x00007F47F419D6F0
    0x28: struct J9HashTable* removeAccessHashTable = !j9hashtable 0x00007F47F419F480
    0x30: struct J9HashTable* removeExportsHashTable = !j9hashtable 0x0000000000000000
    0x38: U32 isLoose = 0x00000000 (0)
    0x3c: U32 isOpen = 0x00000000 (0)
}
tajila commented 8 months ago

I monitored the object in question on every GC. The object seems to be scanned in every GC. The only place where I see the field being changed is below.

#0  MM_CompactSchemeFixupRoots::doSlot (this=0x7fffd5fea960, slot=0x7ffff0b23b30) at ./openj9/runtime/gc_glue_java/CompactSchemeFixupRoots.hpp:49
#1  0x00007ffff5a567ae in MM_RootScanner::scanModularityObjects (this=0x7fffd5fea960, classLoader=0x7ffff00b7638) at ./openj9/runtime/gc_base/RootScanner.cpp:99
#2  0x00007ffff5d3c59b in MM_CompactSchemeFixupRoots::doClassLoader (this=0x7fffd5fea960, classLoader=0x7ffff00b7638) at ./openj9/runtime/gc_glue_java/CompactSchemeFixupRoots.hpp:68
#3  0x00007ffff5a57f2d in MM_RootScanner::scanClassLoaders (this=0x7fffd5fea960, env=0x7fff60001d18) at ./openj9/runtime/gc_base/RootScanner.cpp:472
#4  0x00007ffff5a5c7e4 in MM_RootScanner::scanAllSlots (this=0x7fffd5fea960, env=0x7fff60001d18) at ./openj9/runtime/gc_base/RootScanner.cpp:1098
#5  0x00007ffff5d3bf27 in MM_CompactDelegate::fixupRoots (this=0x7ffff0068ff8, env=0x7fff60001d18, compactScheme=0x7ffff0068f80) at ./openj9/runtime/gc_glue_java/CompactDelegate.cpp:125
#6  0x00007ffff5c9cdfe in MM_CompactScheme::compact (this=0x7ffff0068f80, envBase=0x7fff60001d18, rebuildMarkBits=false, aggressive=false) at ./omr/gc/base/standard/CompactScheme.cpp:610
#7  0x00007ffff5cb0a74 in MM_ParallelCompactTask::run (this=0x7fffd5ee82b0, env=0x7fff60001d18) at ./omr/gc/base/standard/ParallelCompactTask.cpp:45
#8  0x00007ffff5c58c49 in MM_ParallelDispatcher::workerEntryPoint (this=0x7ffff0066b60, env=0x7fff60001d18) at ./omr/gc/base/ParallelDispatcher.cpp:187
#9  0x00007ffff5c5873c in dispatcher_thread_proc2 (portLib=0x7ffff7cbb100 <j9portLibrary>, info=0x7ffff7d3b660) at ./omr/gc/base/ParallelDispatcher.cpp:93
#10 0x00007ffff76c8655 in omrsig_protect (portLibrary=0x7ffff7cbb100 <j9portLibrary>, fn=0x7ffff5c5850b <dispatcher_thread_proc2(OMRPortLibrary*, void*)>, fn_arg=0x7ffff7d3b660, handler=0x7ffff78a1b1d <structuredSignalHandlerVM>, handler_arg=0x7ffff0019760, flags=506, result=0x7fffd5fead00)
    at ./omr/port/unix/omrsignal.c:425
#11 0x00007ffff5c58893 in dispatcher_thread_proc (info=0x7ffff7d3b660) at ./omr/gc/base/ParallelDispatcher.cpp:131
#12 0x00007ffff7824c32 in thread_wrapper (arg=0x7ffff0b348b0) at ./omr/thread/common/omrthread.c:1724
#13 0x00007ffff7deeac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#14 0x00007ffff7e80850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

On one of the GCs after MM_CompactSchemeFixupRoots::doSlot is called the new value is the wrong object.

I noticed that _compactScheme->getForwardingPtr(*slot); returns the wrong object.

tajila commented 8 months ago

At this point I think the issue is in the GC. The VM doesnt touch the field at all. And the module is scanned every GC. @dmitripivkine

tajila commented 8 months ago

@dmitripivkine should the old and new value be valid when the following is called:

*slot = _compactScheme->getForwardingPtr(*slot);
dmitripivkine commented 8 months ago

@dmitripivkine should the old and new value be valid when the following is called:

*slot = _compactScheme->getForwardingPtr(*slot);

No. The only new (destination) object is valid after compact. The source location can be destroyed during compaction later on. Forwarding pointer (to new location) is stored in Compacted Mark Map and does not rely on heap content.

dmitripivkine commented 8 months ago

At this point I think the issue is in the GC. The VM doesnt touch the field at all. And the module is scanned every GC. @dmitripivkine

Did you check that every single GC has discovered module object (even without updating of the pointer)? The theory was the object became stale (lost for one GC and re-discovered later) way earlier than pointer is damaged actually by compact. I guess one easy way to check it run test with -Xcheck:gc. It (probably) can not discover actual problem with module pointer but give you print at the beginning and at the end of every GC. If you combine this mode with print at the place module object is discovered and marked you can find out is it discovered by every single GC.

tajila commented 8 months ago

Did you check that every single GC has discovered module object (even without updating of the pointer)?

Yes I verified that every GC after the module is created, it is scanned.

dmitripivkine commented 8 months ago

The problem is: GC scans System and Application classloaders separately as a permanent roots. Scanning of modules for these classloaders has been missed. I will provide fix shortly.

pshipton commented 7 months ago

We can unexclude the test and close this now that https://github.com/eclipse-openj9/openj9/pull/19020 is merged?

babsingh commented 7 months ago

I had locally verified that https://github.com/eclipse-openj9/openj9/pull/19020 fixes this issue. I have opened https://github.com/adoptium/aqa-tests/pull/5116 to re-enable the test.