eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 720 forks source link

Unhandled exception Type=Segmentation error vmState=0x00020011 #10541

Closed pwagland closed 4 years ago

pwagland commented 4 years ago

Java -version output

openjdk version "11.0.7" 2020-04-14 OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.7+10) Eclipse OpenJ9 VM AdoptOpenJDK (build openj9-0.20.0, JRE 11 Linux amd64-64-Bit Compressed References 20200416_574 (JIT enabled, AOT enabled) OpenJ9 - 05fa2d361 OMR - d4365f371 JCL - 838028fc9d based on jdk-11.0.7+10)

Summary of problem

We are in the process of migrating from using Java 8 to using Java 11 to compile our code. When compiling our code using Java 11, it fails, seemingly reproducibly in the same place.

Diagnostic files

08-Sep-2020 14:01:34 | Unhandled exception
08-Sep-2020 14:01:34 | Type=Segmentation error vmState=0x00020011
08-Sep-2020 14:01:34 | J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000080
08-Sep-2020 14:01:34 | Handler1=00007FBD89F32ED0 Handler2=00007FBD898249F0 InaccessibleAddress=0000000000000000
08-Sep-2020 14:01:34 | RDI=0000000000000001 RSI=007FFFFFFFB00000 RAX=0000000000000000 RBX=0000000001A99900
08-Sep-2020 14:01:34 | RCX=0000000000000000 RDX=00007FBD81031000 R8=00007FBD89831FC0 R9=000000000000000E
08-Sep-2020 14:01:34 | R10=00007FBD632BA120 R11=000000000000005D R12=000000000000004A R13=00007FBD8404A1A0
08-Sep-2020 14:01:34 | R14=00007FBD84023CB0 R15=0000000000000001
08-Sep-2020 14:01:34 | RIP=00007FBD8833D08A GS=0000 FS=0000 RSP=00007FBD632BA210
08-Sep-2020 14:01:34 | EFlags=0000000000010246 CS=0033 RBP=0000000001A99E00 ERR=0000000000000000
08-Sep-2020 14:01:34 | TRAPNO=000000000000000D OLDMASK=0000000000000000 CR2=00007F39A79E6100
08-Sep-2020 14:01:34 | xmm0 0000003000000020 (f: 32.000000, d: 1.018558e-312)
08-Sep-2020 14:01:34 | xmm1 43e0000000000000 (f: 0.000000, d: 9.223372e+18)
08-Sep-2020 14:01:34 | xmm2 408f400000000000 (f: 0.000000, d: 1.000000e+03)
08-Sep-2020 14:01:34 | xmm3 2f6176616a5f6575 (f: 1784636800.000000, d: 1.840923e-80)
08-Sep-2020 14:01:34 | xmm4 0000000000000000 (f: 0.000000, d: 0.000000e+00)
08-Sep-2020 14:01:34 | xmm5 0000000000000000 (f: 0.000000, d: 0.000000e+00)
08-Sep-2020 14:01:34 | xmm6 00007fbd8862c030 (f: 2288173056.000000, d: 6.939251e-310)
08-Sep-2020 14:01:34 | xmm7 000000000000000e (f: 14.000000, d: 6.916919e-323)
08-Sep-2020 14:01:34 | xmm8 0000000000000000 (f: 0.000000, d: 0.000000e+00)
08-Sep-2020 14:01:34 | xmm9 0000000000007cfe (f: 31998.000000, d: 1.580911e-319)
08-Sep-2020 14:01:34 | xmm10 0000000000007cfd (f: 31997.000000, d: 1.580862e-319)
08-Sep-2020 14:01:34 | xmm11 00000000f4b97c38 (f: 4105796608.000000, d: 2.028533e-314)
08-Sep-2020 14:01:34 | xmm12 00007fbd8862c0b0 (f: 2288173312.000000, d: 6.939251e-310)
08-Sep-2020 14:01:34 | xmm13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
08-Sep-2020 14:01:34 | xmm14 00007fbd8862c058 (f: 2288173056.000000, d: 6.939251e-310)
08-Sep-2020 14:01:34 | xmm15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
08-Sep-2020 14:01:34 | Module=/opt/java/openjdk11/lib/compressedrefs/libj9gc29.so
08-Sep-2020 14:01:34 | Module_base_address=00007FBD8815B000
08-Sep-2020 14:01:34 | Target=2_90_20200416_574 (Linux 5.4.0-42-generic)
08-Sep-2020 14:01:34 | CPU=amd64 (8 logical CPUs) (0x7d5a69000 RAM)
08-Sep-2020 14:01:34 | ----------- Stack Backtrace -----------
08-Sep-2020 14:01:34 | (0x00007FBD8833D08A [libj9gc29.so+0x1e208a])
08-Sep-2020 14:01:34 | (0x00007FBD8833D362 [libj9gc29.so+0x1e2362])
08-Sep-2020 14:01:34 | (0x00007FBD882BE535 [libj9gc29.so+0x163535])
08-Sep-2020 14:01:34 | (0x00007FBD882BE6EE [libj9gc29.so+0x1636ee])
08-Sep-2020 14:01:34 | (0x00007FBD8830EAFB [libj9gc29.so+0x1b3afb])
08-Sep-2020 14:01:34 | (0x00007FBD883250D7 [libj9gc29.so+0x1ca0d7])
08-Sep-2020 14:01:34 | (0x00007FBD8828EC17 [libj9gc29.so+0x133c17])
08-Sep-2020 14:01:34 | (0x00007FBD8832B8D6 [libj9gc29.so+0x1d08d6])
08-Sep-2020 14:01:34 | (0x00007FBD8832D927 [libj9gc29.so+0x1d2927])
08-Sep-2020 14:01:34 | (0x00007FBD8832DF38 [libj9gc29.so+0x1d2f38])
08-Sep-2020 14:01:34 | (0x00007FBD882CBA59 [libj9gc29.so+0x170a59])
08-Sep-2020 14:01:34 | (0x00007FBD881A10CF [libj9gc29.so+0x460cf])
08-Sep-2020 14:01:34 | (0x00007FBD83B4296A [libj9jit29.so+0x90396a])
08-Sep-2020 14:01:34 | (0x00007FBD83B542A1 [libj9jit29.so+0x9152a1])
08-Sep-2020 14:01:34 | ---------------------------------------
08-Sep-2020 14:01:34 | JVMDUMP039I Processing dump event "gpf", detail "" at 2020/09/08 14:01:34 - please wait.
08-Sep-2020 14:01:34 | JVMDUMP032I JVM requested System dump using '/var/atlassian/bamboo-agent/xml-data/build-dir/RW-BLD8732-BLD/code/core.20200908.140134.111686.0001.dmp' in response to an event
08-Sep-2020 14:01:39 | JVMDUMP010I System dump written to /var/atlassian/bamboo-agent/xml-data/build-dir/RW-BLD8732-BLD/code/core.20200908.140134.111686.0001.dmp
08-Sep-2020 14:01:39 | JVMDUMP032I JVM requested Java dump using '/var/atlassian/bamboo-agent/xml-data/build-dir/RW-BLD8732-BLD/code/javacore.20200908.140134.111686.0002.txt' in response to an event
08-Sep-2020 14:01:39 | JVMDUMP010I Java dump written to /var/atlassian/bamboo-agent/xml-data/build-dir/RW-BLD8732-BLD/code/javacore.20200908.140134.111686.0002.txt
08-Sep-2020 14:01:39 | JVMDUMP032I JVM requested Snap dump using '/var/atlassian/bamboo-agent/xml-data/build-dir/RW-BLD8732-BLD/code/Snap.20200908.140134.111686.0003.trc' in response to an event
08-Sep-2020 14:01:39 | JVMDUMP010I Snap dump written to /var/atlassian/bamboo-agent/xml-data/build-dir/RW-BLD8732-BLD/code/Snap.20200908.140134.111686.0003.trc
08-Sep-2020 14:01:39 | JVMDUMP007I JVM Requesting JIT dump using '/var/atlassian/bamboo-agent/xml-data/build-dir/RW-BLD8732-BLD/code/jitdump.20200908.140134.111686.0004.dmp'
08-Sep-2020 14:01:39 | JVMDUMP010I JIT dump written to /var/atlassian/bamboo-agent/xml-data/build-dir/RW-BLD8732-BLD/code/jitdump.20200908.140134.111686.0004.dmp
08-Sep-2020 14:01:39 | JVMDUMP013I Processed dump event "gpf", detail "".
08-Sep-2020 14:01:39 |  
08-Sep-2020 14:01:39 | Error: Unexpected error: 255

Our other crash looks almost identical except for a few registrers, and the last two lines of the Stack Backtrace are:

08-Sep-2020 14:01:34 | (0x00007FBD83B4296A [libj9jit29.so+0x90396a])
08-Sep-2020 14:01:34 | (0x00007FBD83B542A1 [libj9jit29.so+0x9152a1])

Sadly, since this is a build server, the generated dump files are being automatically cleaned up. I will try to see if our build plans can be modified to make these available.

OutOfMemoryError: Java Heap Space

This doesn't appear to be related to OutOfMemory, however it might be, how can we tell?

dmitripivkine commented 4 years ago

We need system core dump to investigate and will be happy to do this.

However meanwhile this failure reminds me known problem with Class Loading Verification problem: JVM detects incompatible class and left it partially loaded with init state "failed" and GC has crashed because class object pointer left NULL. (I know this is not adequate reaction of JVM but problem really is in class libraries).

@tajila Would you please point to the related issue? Do you have any suggestions how to find incompatible class if it is a case?

pwagland commented 4 years ago

Hi @dmitripivkine,

I am still looking into this, sadly (of course!) it only fails on our production server that I don't have access to, which means modifying the build to not throw everything away, get that promoted to prod… you know the drill ;-)

Hopefully I should be able to get this sorted out in the next day or so.

pwagland commented 4 years ago

@dmitripivkine, so I have a dump file, however this has embedded passwords (pass in via the environment) such that I am uncomfortable to upload this tmp fie, definitely publicly. Is there any easy way to "sanitise" the tmp file? And, since this file is 1.8G, how can I PM you to organise.a way to transfer this? And/or how can I run the analysis that you need in order to get the information that you want?

dmitripivkine commented 4 years ago

I can see two ways how we can handle this problem.

1) We can speculate this is a problem with partially initialized j9class because of verification error. I discussed this with @tajila and he mentioned if it is a case we should observe exceptions in Snap traces. If you have .trc file and comfortable to share it with us (or convert it to text format using java com.ibm.jvm.format.TraceFormat <file> and inspect for sensitive data and share it fully/partially). Also you can generate Snap traces in text format from system core using jdmpview (this is cross platform utility so you can use not only Linux but Windows or Mac for this just need OpenJ9 to be installed):

<java_home>/bin/jdmpview -core <core>

You should see something like:

$ /jdk-11.0.4+4/bin/jdmpview -core core.20200807.111303.416355.0001.dmp
DTFJView version 4.29.5, using DTFJ version 1.12.29003
Loading image from DTFJ...

Warning : native libraries are not available for core.20200807.111303.416355.0001.dmp
For a list of commands, type "help"; for how to use "help", type "help help"
Available contexts (* = currently selected context) :

Source : file:///core.20200807.111303.416355.0001.dmp
    *0 : PID: 416381 : JRE 16 Linux amd64-64 (build 16-internal+0-adhoc..jdk)

>

and use command !snapformat <file> We can try to spot the problem in trace file first.

2) We can try Do-It-Yourself interactive debugging. From your side it would require to run jdmpview. Native debugger gdb running on Linux machine with problematic version of OpenJ9 installed might be required too if we are going too deep. We can do it on Eclipse OpenJ9 slack (9 to 5 EST) or over email or this item.

You can try speculative short cut in jdmpview first:

Get registers from crash, like:

> !gpinfo
Failing Thread: !j9vmthread 0x19a00
Failing Thread ID: 0x65a64 (416356)
gpInfo:
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
Handler1=00007F61A344AFF0 Handler2=00007F61A31F49B0 InaccessibleAddress=00000000FFE05E30
RDI=00000007FFE5D380 RSI=00007F613549DC46 RAX=000000000037B968 RBX=0000000000019A00
RCX=0000000000000000 RDX=00000000FFE05E00 R8=00007F6135459454 R9=00007F619C00F970
R10=0000000000000006 R11=00007F6135813CE8 R12=000000000037B968 R13=000000000037B700
R14=00007F61A3DA0670 R15=0000000000030000
RIP=00007F61A33F9A73 GS=0000 FS=0000 RSP=00007F61A3DA0530
EFlags=0000000000010206 CS=0033 RBP=00007F61A355A0E0 ERR=0000000000000004
TRAPNO=000000000000000E OLDMASK=0000000000000000 CR2=00000000FFE05E30
xmm0 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm1 ffffffffffffffff (f: 4294967296.000000, d: -nan)
xmm2 ffffffffffffffff (f: 4294967296.000000, d: -nan)
xmm3 00000000000799a0 (f: 498080.000000, d: 2.460842e-318)
xmm4 0000000000079ac0 (f: 498368.000000, d: 2.462265e-318)
xmm5 0000000000079a00 (f: 498176.000000, d: 2.461316e-318)
xmm6 0000000000079960 (f: 498016.000000, d: 2.460526e-318)
xmm7 000000000003ff60 (f: 261984.000000, d: 1.294373e-318)
xmm8 6c2f6176616a4928 (f: 1634355456.000000, d: 1.320535e+213)
xmm9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm10 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm11 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
Module=/home/leob/Projects/openjdk-openj9/openj9-openjdk-jdk/build/linux-x86_64-server-release/images/jdk/lib/compressedrefs/libj9vm29.so
Module_base_address=00007F61A33BD000

The goal of exercise is to find problematic partially initialized j9class. j9class structure is allocated in native memory and aligned to 256 bytes (so an address ends to 00) and located below 4GB bar (an address should be below 0x100000000). I can speculate in a crash registers you published earlier it is RBX=0000000001A99900. So if the crash you have system dump for is exactly the same you should see 00 aligned value in RBX

The first slot of j9class is an eye-catcher 0x99669966, so first you can confirm that an address you have in RBX is a j9class so just dump memory content on address. For example:

> !j9x 0x00049b00
0x00049B00 :  0000000099669966 00007f61a355e040 [ f.f.....@.U.a... ]
0x00049B10 :  0000000000049a88 0000000080070001 [ ................ ]

If you can see eye-catcher 0000000099669966 dump memory as a j9class:

> !j9class 0x00049b00
J9Class at 0x49b00 {
  Fields for J9Class:
    0x0: UDATA eyecatcher = 0x0000000099669966 (2573637990)
    0x8: struct J9ROMClass* romClass = !j9romclass 0x00007F61A355E040
    0x10: void** superclasses = !j9x 0x0000000000049A88
    0x18: UDATA classDepthAndFlags = 0x0000000080070001 (2147942401)
    0x20: U32 classDepthWithFlags = 0x00000000 (0)
    0x24: U32 classFlags = 0x00000000 (0)
    0x28: struct J9ClassLoader* classLoader = !j9classloader 0x00007F619C09A1F8
    0x30: struct J9Object* classObject = !j9object 0x00000006140E0A20 // java/lang/Class
    0x38: volatile UDATA initializeStatus = 0x0000000000000001 (1)
    0x40: struct J9Method* ramMethods = !j9method 0x0000000000049900 // <FAULT>
    0x48: UDATA* ramStatics = !j9x 0x0000000000000001
    0x50: struct J9Class* arrayClass = !j9class 0x000000000015AC00 // [[I
    0x58: UDATA totalInstanceSize = 0x0000000000049900 (301312)
    0x60: struct J9ITable* lastITable = !j9itable 0x0000000000047C78
    0x68: UDATA* instanceDescription = !j9x 0x0000000000000001
    0x70: UDATA* instanceLeafDescription = !j9x 0x0000000000000001
    0x78: UDATA instanceHotFieldDescription = 0x0000000000000000 (0)
    0x80: UDATA selfReferencingField1 = 0x0000000000000000 (0)
    0x88: UDATA selfReferencingField2 = 0x0000000000000000 (0)
    0x90: struct J9Method* initializerCache = !j9method 0x0000000000000000
    0x98: UDATA romableAotITable = 0x00007F61A25AE510 (140057312421136)
    0xa0: UDATA packageID = 0x00007F619C09A1F8 (140057206432248)
    0xa8: struct J9Module* module = !j9module 0x00007F619C09B338
    0xb0: struct J9Class* subclassTraversalLink = !j9class 0x0000000000049700 // [S
    0xb8: struct J9Class* subclassTraversalReverseLink = !j9class 0x0000000000049F00 // [J
    0xc0: void** iTable = !j9x 0x0000000000047C78
    0xc8: UDATA castClassCache = 0x0000000000049701 (300801)
    0xd0: void** jniIDs = !j9x 0x0000000000000000
    0xd8: UDATA lockOffset = 0xFFFFFFFFFFFFFFFF (-1)
    0xe0: U32 paddingForGLRCounters = 0x00000000 (0)
    0xe4: U16 reservedCounter = 0x0000 (0)
    0xe6: U16 cancelCounter = 0x0000 (0)
    0xe8: UDATA newInstanceCount = 0x0000000000000001 (1)
    0xf0: IDATA backfillOffset = 0x0000000000000004 (4)
    0xf8: struct J9Class* replacedClass = !j9class 0x0000000000000000
    0x100: UDATA finalizeLinkOffset = 0x0000000000000000 (0)
    0x108: struct J9Class* nextClassInSegment = !j9class 0x0000000000049900 // int
    0x110: UDATA* ramConstantPool = !j9x 0x0000000000000000
    0x118: struct J9Object** callSites = !j9x 0x0000000000000000
    0x120: struct J9Object** methodTypes = !j9x 0x0000000000000000
    0x128: struct J9Object** varHandleMethodTypes = !j9x 0x0000000000000000
    0x130: struct J9VMCustomSpinOptions* customSpinOption = !j9vmcustomspinoptions 0x0000000000000000
    0x138: void** staticSplitMethodTable = !j9x 0x0000000000000000
    0x140: void** specialSplitMethodTable = !j9x 0x0000000000000000
    0x148: struct J9JITExceptionTable* jitMetaDataList = !j9jitexceptiontable 0x0000000000000000
    0x150: struct J9Class* gcLink = !j9class 0x0000000000000000
    0x158: struct J9Class* hostClass = !j9class 0x0000000000049B00 // [I
    0x160: struct J9Class* nestHost = !j9class 0x0000000000000000
    0x168: struct J9FlattenedClassCache* flattenedClassCache = !j9flattenedclasscache 0x0000000000000004
}
Class name: [I
To view static fields, use !j9statics 0x0000000000049B00
To view instance shape, use !j9classshape 0x0000000000049B00
>

If it is a problematic case you should see 0x30: struct J9Object* classObject = !j9object 0x0000000000000000 slot is set to NULL and 0x38: volatile UDATA initializeStatus = 0x0000000000000003 (3). So if it is a case use Class Name to figure out incompatibilities in your class libraries.

pwagland commented 4 years ago

Let's begin with the easy step first then 👍

Snap.20200914.094455.65828.0003.trc.gz

Following along with your walkthrough:

> !gpinfo
Failing Thread: !j9vmthread 0x1ab5400
Failing Thread ID: 0x12583 (75139)
gpInfo:
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000080
Handler1=00007F7F2517AED0 Handler2=00007F7F24A6C9F0 InaccessibleAddress=0000000000000000
RDI=0000000000000001 RSI=007FFFFFFFB00000 RAX=0000000000000000 RBX=0000000001A8F600
RCX=0000000000000000 RDX=00007F7F1C313000 R8=00007F7F24A79FC0 R9=000000000000000D
R10=00007F7EADAEA120 R11=0000000000000043 R12=000000000000004A R13=00007F7F2004A1A0
R14=00007F7F20023CB0 R15=0000000000000001
RIP=00007F7F1E70308A GS=0000 FS=0000 RSP=00007F7EADAEA210
EFlags=0000000000010246 CS=0033 RBP=0000000001A8FB00 ERR=0000000000000000
TRAPNO=000000000000000D OLDMASK=0000000000000000 CR2=00007F033A367D80
xmm0 0000000001a80e00 (f: 27790848.000000, d: 1.373050e-316)
xmm1 43e0000000000000 (f: 0.000000, d: 9.223372e+18)
xmm2 408f400000000000 (f: 0.000000, d: 1.000000e+03)
xmm3 2f6176616a5f6575 (f: 1784636800.000000, d: 1.840923e-80)
xmm4 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm5 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm6 00007f7f1e9f2030 (f: 513744960.000000, d: 6.926007e-310)
xmm7 0000000000000008 (f: 8.000000, d: 3.952525e-323)
xmm8 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm9 00000000000007de (f: 2014.000000, d: 9.950482e-321)
xmm10 00000000000007dd (f: 2013.000000, d: 9.945541e-321)
xmm11 00000000f0c42678 (f: 4039386624.000000, d: 1.995722e-314)
xmm12 00007f7f1e9f20b0 (f: 513745088.000000, d: 6.926007e-310)
xmm13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm14 00007f7f1e9f2058 (f: 513744992.000000, d: 6.926007e-310)
xmm15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
Module=/opt/java/openjdk11/lib/compressedrefs/libj9gc29.so
Module_base_address=00007F7F1E521000

RBX looks roughly the same, so looking into that…

> !j9x 0x1A8F600
0x01A8F600 :  0000000099669966 00007f7e903c8a38 [ f.f.....8.<.~... ]
0x01A8F610 :  0000000001a8e310 00000000000e0001 [ ................ ]

Looks like a j9class… so:

> !j9class 0x1A8F600
J9Class at 0x1a8f600 {
  Fields for J9Class:
    0x0: UDATA eyecatcher = 0x0000000099669966 (2573637990)
    0x8: struct J9ROMClass* romClass = !j9romclass 0x00007F7E903C8A38
    0x10: void** superclasses = !j9x 0x0000000001A8E310
    0x18: UDATA classDepthAndFlags = 0x00000000000E0001 (917505)
    0x20: U32 classDepthWithFlags = 0x00000000 (0)
    0x24: U32 classFlags = 0x00000000 (0)
    0x28: struct J9ClassLoader* classLoader = !j9classloader 0x00007F7F22B0DE78
    0x30: struct J9Object* classObject = !j9object 0x0000000000000000
    0x38: volatile UDATA initializeStatus = 0x0000000000000003 (3)
    0x40: struct J9Method* ramMethods = !j9method 0x0000000001A8F3C8 // com/redwood/scheduler/infrastructure/logging/api/Logger.debug(Ljava/lang/String;)V
    0x48: UDATA* ramStatics = !j9x 0x0000000000000000
    0x50: struct J9Class* arrayClass = !j9class 0x0000000000000000
    0x58: UDATA totalInstanceSize = 0x000000000000000D (13)
    0x60: struct J9ITable* lastITable = !j9itable 0x0000000001A8E340
    0x68: UDATA* instanceDescription = !j9x 0x0000000000000001
    0x70: UDATA* instanceLeafDescription = !j9x 0x0000000000000001
    0x78: UDATA instanceHotFieldDescription = 0x0000000000000000 (0)
    0x80: UDATA selfReferencingField1 = 0x0000000000000000 (0)
    0x88: UDATA selfReferencingField2 = 0x0000000000000000 (0)
    0x90: struct J9Method* initializerCache = !j9method 0x0000000000000000
    0x98: UDATA romableAotITable = 0x0000000000000000 (0)
    0xa0: UDATA packageID = 0x00007F7E903C8A39 (140181562493497)
    0xa8: struct J9Module* module = !j9module 0x0000000000000000
    0xb0: struct J9Class* subclassTraversalLink = !j9class 0x0000000001A8F200 // com/redwood/scheduler/script/java/valet/transform/impl/v2/ImportRuleSetPostPackHandler
    0xb8: struct J9Class* subclassTraversalReverseLink = !j9class 0x0000000001759E00 // org/apache/tools/ant/taskdefs/Javadoc$DocletParam
    0xc0: void** iTable = !j9x 0x0000000001A8E340
    0xc8: UDATA castClassCache = 0x0000000000000000 (0)
    0xd0: void** jniIDs = !j9x 0x0000000000000000
    0xd8: UDATA lockOffset = 0x0000000000000004 (4)
    0xe0: U32 paddingForGLRCounters = 0x00000000 (0)
    0xe4: U16 reservedCounter = 0x0000 (0)
    0xe6: U16 cancelCounter = 0x0000 (0)
    0xe8: UDATA newInstanceCount = 0x0000000000000000 (0)
    0xf0: IDATA backfillOffset = 0x0000000000000008 (8)
    0xf8: struct J9Class* replacedClass = !j9class 0x0000000000000000
    0x100: UDATA finalizeLinkOffset = 0x0000000000000000 (0)
    0x108: struct J9Class* nextClassInSegment = !j9class 0x0000000001A8F200 // com/redwood/scheduler/script/java/valet/transform/impl/v2/ImportRuleSetPostPackHandler
    0x110: UDATA* ramConstantPool = !j9x 0x0000000001A8E270
    0x118: struct J9Object** callSites = !j9x 0x0000000000000000
    0x120: struct J9Object** methodTypes = !j9x 0x0000000000000000
    0x128: struct J9Object** varHandleMethodTypes = !j9x 0x0000000000000000
    0x130: struct J9VMCustomSpinOptions* customSpinOption = !j9vmcustomspinoptions 0x0000000000000000
    0x138: void** staticSplitMethodTable = !j9x 0x0000000000000000
    0x140: void** specialSplitMethodTable = !j9x 0x0000000000000000
    0x148: struct J9JITExceptionTable* jitMetaDataList = !j9jitexceptiontable 0x0000000000000000
    0x150: struct J9Class* gcLink = !j9class 0x0000000000000000
    0x158: struct J9Class* hostClass = !j9class 0x0000000001A8F600 // com/redwood/scheduler/infrastructure/logging/api/Logger
    0x160: struct J9Class* nestHost = !j9class 0x0000000000000000
    0x168: struct J9FlattenedClassCache* flattenedClassCache = !j9flattenedclasscache 0x0000000000000000
}
Class name: com/redwood/scheduler/infrastructure/logging/api/Logger
To view static fields, use !j9statics 0x0000000001A8F600
To view instance shape, use !j9classshape 0x0000000001A8F600

Right, so that means that com/redwood/scheduler/infrastructure/logging/api/Logger is problematic class? Or is it com/redwood/scheduler/script/java/valet/transform/impl/v2/ImportRuleSetPostPackHandler?

The second is in a precompiled JAR, that has been working perfectly under Java (openJ9) 8 for quite a while. However it was modified in the past with proguard, maybe the stricter checking in Java 11 means that this JAR file is no detected as invalid?

tajila commented 4 years ago

It looks like the problem is com/redwood/scheduler/infrastructure/logging/api/Logger. It is uninitialized and has a null classobject.

0x30: struct J9Object* classObject = !j9object 0x0000000000000000
0x38: volatile UDATA initializeStatus = 0x0000000000000003 (3)
tajila commented 4 years ago

I took a look at the snap trace. There is not enough data in there to determine why the class was not able to be initialized. If the problem is easily reproduceable can you re-run with -Xtrace:maximal={j9vm{level10}},buffers=15m and generate a snaptrace using jdmpview with !snapformat once you get the system core. This will let us know if there was any exception while attempting to link the class.

pwagland commented 4 years ago

Thank @tajila, sadly this is all too reproducible for me ;-)

I'll see if I can run this tomorrow with these arguments.

pwagland commented 4 years ago

Hmm. right, well, just kicked off a build, that'll take a few hours to fail, I'll try to update this ticket on my tomorrow.

However, I just noticed that this is failing during the compilation of one of our targets. It has managed to compile this target four times, and fails on the fifth attempt. Each time it is compiling (sans timestamps) the same thing.

pwagland commented 4 years ago

Here is the requested !snapformat > SnapFormat.txt. SnapFormat.txt.gz

I have double checked, as far as I can tell, Java dies while the Java compiler is being called (ejc via ant), and the Logger class is the very first project compiled, and has been included in many hundreds of compilations until this point. It is not recompiled at this point, but is included via a JAR file. It also appears (based on a limited number of test runs) to always fail during compilation, but it does fail while building different targets.

Edit: Note that the offsets are slightly different, however it is the same class that appears to be causing the issue. Offset in latest dump was 0x1A93600.

pwagland commented 4 years ago

@tajila, did this help to determine why this is failing? I am at a loss as to how this could happen, and why we see it at random points in our build cycle. Is there anything that I can do to help move this issue forward?

tajila commented 4 years ago

@pwagland The last time we encountered an issue similar to this it was because the class failed to load properly due to a classloading constraint error. The reason why the crash occurs at random times is because the bad class may be scanned at any time by the GC (assuming that is the case).

We are currently working on a solution to fix the cause of the crash but that may take some time. In the short term I suggest first determining if the scenario I described is indeed the problem in this case, and if so, fixing the root cause (finding the classload error).

I was not able to get much more details from the last snaptrace you sent. The time at which the class fails to load and the time at which the crash occurs may be far enough apart that the bad class is removed from the trace buffers. The following trace option targets the specific cases which I think may be causing the error.

-Xtrace:maximal={j9vm.80,j9vm.302,j9vm.471,j9vm.95,j9vm.91,j9vm.2,j9vm.223,j9vm.94,j9vm.92},buffers=30m

While we wait for those results, in the core files that you already have can you try using jdmpview:

!classforname com/redwood/scheduler/infrastructure/logging/api/Logger

If there are multiple versions of this class then this may indicate a classloading constraint problem that I am suspecting. If you see multiple classes try !j9vtables [classAddr]. If it is a classloading constraints issue the vtables should be empty.

pwagland commented 4 years ago

Hi @tajila, right, I will kick off a new build with the Xtrace that you asked for, you should see results from that tomorrow.

As to the !classforname:

> !classforname com/redwood/scheduler/infrastructure/logging/api/Logger
Searching for classes named 'com/redwood/scheduler/infrastructure/logging/api/Logger' in VM=7f715400f290
!j9class 0x0000000001D1E800 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000001CF8600 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000001CC8700 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000001C95E00 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000001AF2800 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000001A93600 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000001A6D400 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000001A3D900 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000001A0B100 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x000000000186BC00 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x000000000182F200 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000001821600 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x00000000017F0C00 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x00000000017C2F00 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000001798000 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000001767400 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000001212F00 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x00000000011ED600 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x00000000011ADE00 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000001172000 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000001157200 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000001148B00 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000001132F00 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x000000000111EE00 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000001111200 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x00000000010FD100 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x00000000010EF500 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000001090D00 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x000000000105A300 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x000000000102D200 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000000FEE700 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000000FC2100 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000000F8BB00 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x000000000095BB00 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000000935900 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x00000000008B3B00 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000000817800 named com/redwood/scheduler/infrastructure/logging/api/Logger
!j9class 0x0000000000536400 named com/redwood/scheduler/infrastructure/logging/api/Logger
Found 38 class(es) named com/redwood/scheduler/infrastructure/logging/api/Logger

Some of these have a classObject, and some don't:

> J9Class at 0x935900 {
    0x30: struct J9Object* classObject = !j9object 0x00000000A8C050F0 // java/lang/Class
    0x38: volatile UDATA initializeStatus = 0x0000000000000001 (1)
> J9Class at 0x8b3b00 {
    0x30: struct J9Object* classObject = !j9object 0x0000000000000000
    0x38: volatile UDATA initializeStatus = 0x0000000000000003 (3)
> J9Class at 0x817800 {
    0x30: struct J9Object* classObject = !j9object 0x00000000A8B58D90 // java/lang/Class
    0x38: volatile UDATA initializeStatus = 0x0000000000000001 (1)
> J9Class at 0x536400 {
    0x30: struct J9Object* classObject = !j9object 0x00000000A1213BA0 // java/lang/Class
    0x38: volatile UDATA initializeStatus = 0x0000000000000001 (1)

These are from the second trace, where we died processing 0x1a93600, which indeed doesn't isn't initialised:

> J9Class at 0x1a93600 {
    0x30: struct J9Object* classObject = !j9object 0x0000000000000000
    0x38: volatile UDATA initializeStatus = 0x0000000000000003 (3)
pwagland commented 4 years ago

Thinking about this some more, and then figuring that too much information is probably better than not enough, I also add, that there are 38 classes loaded, and also 38 different class loaders, at lsat based on the 0x28 struct J9ClassLoader record. I kind of expect them all being different classloaders to be honest, since we are doing a lot of independent compiles, so this class/JAR file gets pulled into a log of different ant java compilation tasks, which are all done "in JVM".

27 of the classes were initializeStatus = 3, and 11 were initializeStatus = 1. Based on the sort | uniq -c:

  27    0x30: struct J9Object* classObject = !j9object 0x0000000000000000
   1    0x30: struct J9Object* classObject = !j9object 0x00000000A1213BA0 // java/lang/Class
   1    0x30: struct J9Object* classObject = !j9object 0x00000000A8B58D90 // java/lang/Class
   1    0x30: struct J9Object* classObject = !j9object 0x00000000A8C050F0 // java/lang/Class
   1    0x30: struct J9Object* classObject = !j9object 0x00000000AEE85E20 // java/lang/Class
   1    0x30: struct J9Object* classObject = !j9object 0x00000000B4427CB0 // java/lang/Class
   1    0x30: struct J9Object* classObject = !j9object 0x00000000B444DD98 // java/lang/Class
   1    0x30: struct J9Object* classObject = !j9object 0x00000000B7E44868 // java/lang/Class
   1    0x30: struct J9Object* classObject = !j9object 0x00000000B7E54858 // java/lang/Class
   3    0x30: struct J9Object* classObject = !j9object 0xFFFFFFFFFFFFFFFF<FAULT>
  11    0x38: volatile UDATA initializeStatus = 0x0000000000000001 (1)
  27    0x38: volatile UDATA initializeStatus = 0x0000000000000003 (3)
tajila commented 4 years ago

Thanks for the info:

In the core file above what is the output of:

!j9vtables 0x8b3b00 ?

Also if you compare !j9class 0x935900 (a good class) and j9class 0x8b3b00 (a bad class) what is the value for the romClass?

pwagland commented 4 years ago
> !j9vtables 0x8b3b00
VTable for j9class 0x00000000008B3B00  (size=0 - 1 for skipped resolve method)
    Interpreted     Jitted
> J9Class at 0x935900 {
    0x8: struct J9ROMClass* romClass = !j9romclass 0x00007F71553DD910
    0x30: struct J9Object* classObject = !j9object 0x00000000A8C050F0 // java/lang/Class
    0x38: volatile UDATA initializeStatus = 0x0000000000000001 (1)
> J9Class at 0x8b3b00 {
    0x8: struct J9ROMClass* romClass = !j9romclass 0x00007F71551E94E8
    0x30: struct J9Object* classObject = !j9object 0x0000000000000000
    0x38: volatile UDATA initializeStatus = 0x0000000000000003 (3)

All of the romClass pointers are different. Expanding out the above to romClass:

> !j9romclass 0x00007F71553DD910
J9ROMClass at 0x7f71553dd910 {
  Fields for J9ROMClass:
    0x0: U32 romSize = 0x00000490 (1168)
    0x4: U32 singleScalarStaticCount = 0x00000000 (0)
    0x8: J9SRP(J9UTF8) className = !j9utf8 0x00007F71553DDCEA
    0xc: J9SRP(J9UTF8) superclassName = !j9utf8 0x00007F71380285DE
    0x10: U32 modifiers = 0x00000601 (1537)
    0x14: U32 extraModifiers = 0x08800000 (142606336)
    0x18: U32 interfaceCount = 0x00000000 (0)
    0x1c: J9SRP(J9SRP(J9UTF8)) interfaces = !j9utf8 0x00007F71553DDD2C
    0x20: U32 romMethodCount = 0x0000000D (13)
    0x24: J9SRP(J9ROMMethod) romMethods = !j9rommethod 0x00007F71553DDA30
    0x28: U32 romFieldCount = 0x00000000 (0)
    0x2c: J9SRP(J9ROMFieldShape) romFields = !j9romfieldshape 0x00007F71553DDA10
    0x30: U32 objectStaticCount = 0x00000000 (0)
    0x34: U32 doubleScalarStaticCount = 0x00000000 (0)
    0x38: U32 ramConstantPoolCount = 0x0000000A (10)
    0x3c: U32 romConstantPoolCount = 0x0000000A (10)
    0x40: J9WSRP(U8) intermediateClassData = 0x90 (144)
    0x48: U32 intermediateClassDataLength = 0x00000490 (1168)
    0x4c: U32 instanceShape = 0x0000000E (14)
    0x50: J9SRP(U32) cpShapeDescription = 0x01010200 (16843264)
    0x54: J9SRP(J9UTF8) outerClassName = !j9utf8 0x0000000000000000
    0x58: U32 memberAccessFlags = 0x00000000 (0)
    0x5c: U32 innerClassCount = 0x00000000 (0)
    0x60: J9SRP(J9SRP(J9UTF8)) innerClasses = !j9utf8 0x00007F71553DDD2C
    0x64: J9SRP nestHost = !j9x 0x0000000000000000
    0x68: U16 nestMemberCount = 0x0000 (0)
    0x6a: U16 unused = 0x0000 (0)
    0x6c: J9SRP nestMembers = !j9x 0x00007F71553DDA10
    0x70: U16 majorVersion = 0x0034 (52)
    0x72: U16 minorVersion = 0x0000 (0)
    0x74: U32 optionalFlags = 0x00000001 (1)
    0x78: J9SRP(U32) optionalInfo = 0x0000010E (270)
    0x7c: U32 maxBranchCount = 0x00000003 (3)
    0x80: U32 methodTypeCount = 0x00000000 (0)
    0x84: U32 varHandleMethodTypeCount = 0x00000000 (0)
    0x88: U32 bsmCount = 0x00000000 (0)
    0x8c: U32 callSiteCount = 0x00000000 (0)
    0x90: J9SRP callSiteData = !j9x 0x0000000000000000
    0x94: U32 classFileSize = 0x00000545 (1349)
    0x98: U32 classFileCPCount = 0x00000037 (55)
    0x9c: U16 staticSplitMethodRefCount = 0x0000 (0)
    0x9e: U16 specialSplitMethodRefCount = 0x0000 (0)
    0xa0: J9SRP(U16) staticSplitMethodRefIndexes = <FAULT>
    0xa4: J9SRP(U16) specialSplitMethodRefIndexes = <FAULT>
    0xa8: J9SRP varHandleMethodTypeLookupTable = !j9x 0x0000000000000000
    0xac: U32 padding = 0x00000000 (0)
}
> !j9romclass 0x00007F71551E94E8
J9ROMClass at 0x7f71551e94e8 {
  Fields for J9ROMClass:
    0x0: U32 romSize = 0x000004B0 (1200)
    0x4: U32 singleScalarStaticCount = 0x00000000 (0)
    0x8: J9SRP(J9UTF8) className = !j9utf8 0x00007F71551E98EE
    0xc: J9SRP(J9UTF8) superclassName = !j9utf8 0x00007F71380285DE
    0x10: U32 modifiers = 0x00000601 (1537)
    0x14: U32 extraModifiers = 0x08800000 (142606336)
    0x18: U32 interfaceCount = 0x00000000 (0)
    0x1c: J9SRP(J9SRP(J9UTF8)) interfaces = !j9utf8 0x00007F71553627FE
    0x20: U32 romMethodCount = 0x0000000D (13)
    0x24: J9SRP(J9ROMMethod) romMethods = !j9rommethod 0x00007F71551E9608
    0x28: U32 romFieldCount = 0x00000000 (0)
    0x2c: J9SRP(J9ROMFieldShape) romFields = !j9romfieldshape 0x00007F71551E95E8
    0x30: U32 objectStaticCount = 0x00000000 (0)
    0x34: U32 doubleScalarStaticCount = 0x00000000 (0)
    0x38: U32 ramConstantPoolCount = 0x0000000A (10)
    0x3c: U32 romConstantPoolCount = 0x0000000A (10)
    0x40: J9WSRP(U8) intermediateClassData = 0xB0 (176)
    0x48: U32 intermediateClassDataLength = 0x000004B0 (1200)
    0x4c: U32 instanceShape = 0x0000000E (14)
    0x50: J9SRP(U32) cpShapeDescription = 0x01010200 (16843264)
    0x54: J9SRP(J9UTF8) outerClassName = !j9utf8 0x0000000000000000
    0x58: U32 memberAccessFlags = 0x00000000 (0)
    0x5c: U32 innerClassCount = 0x00000000 (0)
    0x60: J9SRP(J9SRP(J9UTF8)) innerClasses = !j9utf8 0x00007F71553627FE
    0x64: J9SRP nestHost = !j9x 0x0000000000000000
    0x68: U16 nestMemberCount = 0x0000 (0)
    0x6a: U16 unused = 0x0000 (0)
    0x6c: J9SRP nestMembers = !j9x 0x00007F71551E95E8
    0x70: U16 majorVersion = 0x0034 (52)
    0x72: U16 minorVersion = 0x0000 (0)
    0x74: U32 optionalFlags = 0x00000001 (1)
    0x78: J9SRP(U32) optionalInfo = 0x0000013A (314)
    0x7c: U32 maxBranchCount = 0x00000003 (3)
    0x80: U32 methodTypeCount = 0x00000000 (0)
    0x84: U32 varHandleMethodTypeCount = 0x00000000 (0)
    0x88: U32 bsmCount = 0x00000000 (0)
    0x8c: U32 callSiteCount = 0x00000000 (0)
    0x90: J9SRP callSiteData = !j9x 0x0000000000000000
    0x94: U32 classFileSize = 0x00000545 (1349)
    0x98: U32 classFileCPCount = 0x00000037 (55)
    0x9c: U16 staticSplitMethodRefCount = 0x0000 (0)
    0x9e: U16 specialSplitMethodRefCount = 0x0000 (0)
    0xa0: J9SRP(U16) staticSplitMethodRefIndexes = <FAULT>
    0xa4: J9SRP(U16) specialSplitMethodRefIndexes = <FAULT>
    0xa8: J9SRP varHandleMethodTypeLookupTable = !j9x 0x0000000000000000
    0xac: U32 padding = 0x00000000 (0)
}
pwagland commented 4 years ago

Right, digging around a little further I see that (probably?) all of the initializeStatus = 3 classes have been loaded by one classLoader, and all of the initializeStatus = 1 have been loaded by a different (the normal ant) classLoader. This feels fishy to me ;-)

This ClassLoader works perfectly when compiling under Java 8. but somehow appears to be causing some form of failure when run OpenJ9 on our server, again I cannot reproduce this locally at all, neither under docker, nor on my Mac. :-(

tajila commented 4 years ago

Okay the fact that the vtable is empty for the bad class suggests a classloading constraints issue. To confirm try:

!dumpromclass 0x00007F71553DD910 (good class) !dumpromclass 0x00007F71551E94E8 (bad class)

The romClasses need to be identical or else class loading constraints are not met.

pwagland commented 4 years ago

They are not identical:

23:46 $ diff bad.romclass good.romclass
1c1
< ROM Size: 0x4b0 (1200)
---
> ROM Size: 0x490 (1168)
9c9
< Intermediate Class Data (1200 bytes): 7f71551e94e8
---
> Intermediate Class Data (1168 bytes): 7f71553dd910
144c144
<    37 genericReturn
---
>    37 return0

Is there any way to see from the dump from which file a class was loaded? I think that these two classes probably are being loaded from different files, so they might be different.

I think (but this is why I would to figure out from which JAR files the respective class files were loaded) that files loaded by Ant are actually executed, while the files loaded by the other classLoader are only used by compilation, and are not actually run.

pwagland commented 4 years ago

SnapFormat.txt.gz

In the log file, it does indicate that there is a constraint violation:

22:24:22.964659000  0x14E00 j9vm.302             Entry      >setCurrentExceptionUTF exceptionNumber=43 detailUTF=loading constraint violation: loader "com/redwood/scheduler/api/valet/FallbackClassLoader@262fa9d8" previously initiated loading for a different type with name "com/redwood/scheduler/infrastructure/logging/api/Logger" defined by loader "org/apache/tools/ant/AntClassLoader@ea10924"
22:24:22.964668000  0x14E00 j9vm.294             Entry      >setCurrentException index=43 constructorIndex=0 detailMessage=0xfc7568b8

I'm not sure that I 100% understand this message, this is saying that the FallbackClassLoader has been asked for Logger twice, and the first time it gave back a copy from AntClassLoader, and this time it is trying to give back a different Object?

This might be possible, since FallbackClassLoader falls back to the underlying loader (assumedly Ant) for the LoggerFactory, and that class probably gets its Logger that it returns from the Ant ClassLoader. Later when the code we want to execute tries to get a Logger, it probably asks the ClassLoader for a Logger, and then gets confused since the LoggerFactory already instantiated one without telling the fallback ClassLoader? And now they don't match?

pwagland commented 4 years ago

Based on the responses in this issue, I have managed to create a workaround to the JVM crashing.

However, I am not sure whether I am just working around the issue, or whether I am actually fixing an issue with our code. If this didn't crash the JVM, what exception would/should I expect to see?

For reference, the source to FallbackClassLoader looks (roughly) like:

public class FallbackClassLoader
  extends URLClassLoader
{
  private final ClassLoader fallback;

  public FallbackClassLoader(URL[] urls, ClassLoader newFallBack)
  {
    super(urls, null);
    fallback = newFallBack;
  }

  @Override
  protected Class<?> findClass(String name)
    throws ClassNotFoundException
  {
    try
    {
      if (!"<…>.LoggerFactory".equals(name))
      {
        return super.findClass(name);
      }
    }
    catch (LinkageError e)
    {
      // Could not load class, so falling back to the one from the alternative loader.
      if (!LinkageError.class.getName().equals(e.getClass().getName()))
      {
        throw e;
      }
    }
    return fallback.loadClass(name);
  }
}

Adding Logger to that if statement causes the JVM to not crash anymore, which probably also means that the constraint violation doesn't occur anymore.

However, what constraint is being violated? The running code never sees a ClassCastException or some such, so what is actually going wrong here?

tajila commented 4 years ago

Is there any way to see from the dump from which file a class was loaded? I think that these two classes probably are being loaded from different files, so they might be different.

Yes, and I believe this is the next step. So we will be able to find the original file for the good class, but Im not sure we can do the same for the bad class.

Note: By "good" I simply mean the first first class that was loaded, and "bad" is the second version.

Lets start with the good class: dump the classObject !j9object 0x00000000A8C050F0 this is from !j9class 0x935900

Then dump protectionDomain: !fj9object [addr] from Ljava/security/ProtectionDomain; protectionDomain = !fj9object

Then dump codesource: !fj9object [addr] from Ljava/security/CodeSource; codesource = !fj9object

Then dump location: !fj9object [addr] from Ljava/net/URL; location = !fj9object

Then dump path: !fj9object [addr] from Ljava/lang/String; path = !fj9object

Dump file to confirm: !fj9object [addr] from Ljava/lang/String; file = !fj9object

So thats where the good class came from. Perhaps this info may help you find where the bad class is coming from.

Since we don't have the classObject for the bad class, it will be more difficult since that data will be thrown away after the loading constraint violation. The only way I can think of finding it right now is by creating a patched build for you to try. If you like I can create that for you.

tajila commented 4 years ago

I'm not sure that I 100% understand this message, this is saying that the FallbackClassLoader has been asked for Logger twice, and the first time it gave back a copy from AntClassLoader, and this time it is trying to give back a different Object?

What it is saying is that is that the loading of the good class was initiated by FallbackClassLoader and defined by AntClassLoader.

Now, the bad class classload is being initiated by a different loader, it may have the same name but it will be a different instance of the loader.

you can check by doing:

!j9class 0x8b3b00 (bad class)

then !j9classloader [addr] from 0x28: struct J9ClassLoader* classLoader = !j9classloader

then dump the classloader object with !j9object [addr] from 0x18: struct J9Object* classLoaderObject = !j9object

You run !classloaderssummary in jdmpview it will give you a summary of all the classloaders.

tajila commented 4 years ago

If the classloader of the good and bad class are of different types that may help us identify where the problem is coming from.

pwagland commented 4 years ago

@tajila, thanks for your help in tracking this down!

The classloader of the good and bad types is different, this is why I posted the (equivalent) source to the our classloader.

Going through the requested answers from your 3rd last comment:

dump the classObject !j9object 0x00000000A8C050F0 this is from !j9class 0x935900

> !j9object 0x00000000A8C050f0
!J9Object 0x00000000A8C050F0 {
    struct J9Class* clazz = !j9class 0x41F00 // java/lang/Class
    Object flags = 0x0000000A;
    I lockword = 0x00000000 (offset = 0) (java/lang/Object) <hidden>
    J vmRef = 0x0000000000935900 (offset = 4) (java/lang/Class)
    Ljava/lang/ClassLoader; classLoader = !fj9object 0xa0521698 (offset = 12) (java/lang/Class)
    Ljava/lang/Module; module = !fj9object 0xa0521720 (offset = 16) (java/lang/Class)
    Ljava/security/ProtectionDomain; protectionDomain = !fj9object 0xa7e74630 (offset = 20) (java/lang/Class)
    Ljava/lang/String; classNameString = !fj9object 0xa1213bf8 (offset = 24) (java/lang/Class)
    Ljava/lang/String; fileNameString = !fj9object 0x0 (offset = 28) (java/lang/Class)
    Ljava/lang/Class$AnnotationVars; annotationVars = !fj9object 0x0 (offset = 32) (java/lang/Class)
    Ljava/lang/ClassValue$ClassValueMap; classValueMap = !fj9object 0x0 (offset = 36) (java/lang/Class)
    Ljava/lang/Class$EnumVars; enumVars = !fj9object 0x0 (offset = 40) (java/lang/Class)
    Ljava/lang/J9VMInternals$ClassInitializationLock; initializationLock = !fj9object 0x0 (offset = 44) (java/lang/Class)
    Ljava/lang/Object; methodHandleCache = !fj9object 0x0 (offset = 48) (java/lang/Class)
    Ljava/lang/Class$ClassRepositoryHolder; classRepoHolder = !fj9object 0x0 (offset = 52) (java/lang/Class)
    Ljava/lang/Class$AnnotationCache; annotationCache = !fj9object 0x0 (offset = 56) (java/lang/Class)
    [Ljava/lang/Class; cachedInterfaces = !fj9object 0x0 (offset = 60) (java/lang/Class)
    Ljava/lang/Class; cachedDeclaringClass = !fj9object 0x0 (offset = 64) (java/lang/Class)
    Ljava/lang/Class; cachedEnclosingClass = !fj9object 0x0 (offset = 68) (java/lang/Class)
    Ljava/lang/Class; nestHost = !fj9object 0x0 (offset = 72) (java/lang/Class)
    Ljava/lang/Class$ReflectCache; reflectCache = !fj9object 0x0 (offset = 76) (java/lang/Class)
}

Then dump protectionDomain: !fj9object [addr] from Ljava/security/ProtectionDomain; protectionDomain = !fj9object

> !fj9object 0xa7e74630
!J9Object 0x00000000A7E74630 {
    struct J9Class* clazz = !j9class 0xD3200 // java/security/ProtectionDomain
    Object flags = 0x00000000;
    I lockword = 0x00000000 (offset = 0) (java/lang/Object) <hidden>
    Ljava/security/CodeSource; codesource = !fj9object 0xa7f69800 (offset = 4) (java/security/ProtectionDomain)
    Ljava/lang/ClassLoader; classloader = !fj9object 0xa0521698 (offset = 8) (java/security/ProtectionDomain)
    [Ljava/security/Principal; principals = !fj9object 0xa7f69a68 (offset = 12) (java/security/ProtectionDomain)
    Ljava/security/PermissionCollection; permissions = !fj9object 0xa00be788 (offset = 16) (java/security/ProtectionDomain)
    Z hasAllPerm = 0x00000000 (offset = 24) (java/security/ProtectionDomain)
    Z staticPermissions = 0x00000000 (offset = 28) (java/security/ProtectionDomain)
    Ljava/security/ProtectionDomain$Key; key = !fj9object 0xa7f69a78 (offset = 20) (java/security/ProtectionDomain)
}

Then dump codesource: !fj9object [addr] from Ljava/security/CodeSource; codesource = !fj9object

> !fj9object 0xa7f69800
!J9Object 0x00000000A7F69800 {
    struct J9Class* clazz = !j9class 0x5FA00 // java/security/CodeSource
    Object flags = 0x00000000;
    I lockword = 0x00000000 (offset = 0) (java/lang/Object) <hidden>
    Ljava/net/URL; location = !fj9object 0xa7f69820 (offset = 4) (java/security/CodeSource)
    [Ljava/security/CodeSigner; signers = !fj9object 0x0 (offset = 8) (java/security/CodeSource)
    [Ljava/security/cert/Certificate; certs = !fj9object 0x0 (offset = 12) (java/security/CodeSource)
    Ljava/net/SocketPermission; sp = !fj9object 0x0 (offset = 16) (java/security/CodeSource)
    Ljava/security/cert/CertificateFactory; factory = !fj9object 0x0 (offset = 20) (java/security/CodeSource)
    Ljava/lang/String; locationNoFragString = !fj9object 0xa7f69860 (offset = 24) (java/security/CodeSource)
}

Then dump location: !fj9object [addr] from Ljava/net/URL; location = !fj9object

> !fj9object 0xa7f69820
!J9Object 0x00000000A7F69820 {
    struct J9Class* clazz = !j9class 0xDC600 // java/net/URL
    Object flags = 0x00000000;
    I lockword = 0x00000000 (offset = 0) (java/lang/Object) <hidden>
    Ljava/lang/String; protocol = !fj9object 0xa7f69870 (offset = 4) (java/net/URL)
    Ljava/lang/String; host = !fj9object 0xa0005100 (offset = 8) (java/net/URL)
    I port = 0xFFFFFFFF (offset = 48) (java/net/URL)
    Ljava/lang/String; file = !fj9object 0xa7f69880 (offset = 12) (java/net/URL)
    Ljava/lang/String; query = !fj9object 0x0 (offset = 16) (java/net/URL)
    Ljava/lang/String; authority = !fj9object 0x0 (offset = 20) (java/net/URL)
    Ljava/lang/String; path = !fj9object 0xa7f69880 (offset = 24) (java/net/URL)
    Ljava/lang/String; userInfo = !fj9object 0x0 (offset = 28) (java/net/URL)
    Ljava/lang/String; ref = !fj9object 0x0 (offset = 32) (java/net/URL)
    Ljava/net/InetAddress; hostAddress = !fj9object 0x0 (offset = 36) (java/net/URL)
    Ljava/net/URLStreamHandler; handler = !fj9object 0xa00491c0 (offset = 40) (java/net/URL)
    I hashCode = 0xFFFFFFFF (offset = 52) (java/net/URL)
    Ljava/net/UrlDeserializedState; tempState = !fj9object 0x0 (offset = 44) (java/net/URL)
}

Then dump path: !fj9object [addr] from Ljava/lang/String; path = !fj9object

> !fj9object 0xa7f69880
J9VMJavaLangString at 0x00000000A7F69880 {
struct J9Class* clazz = !j9class 0x47A00 // java/lang/String
Object flags = 0x00000000;
[B value = !fj9object 0xa7f69980 (offset = 0) (java/lang/String)
B coder = 0x00000001 (offset = 4) (java/lang/String)
I hashCode = 0x00000000 (offset = 8) (java/lang/String)
"/XXXXXX/releases/jbs-tools/valet.jar"
}

Dump file to confirm: !fj9object [addr] from Ljava/lang/String; file = !fj9object

file and path are the same object.

So thats where the good class came from. Perhaps this info may help you find where the bad class is coming from.

Based on this, I am now about 90% sure that the "Bad" file comes from out compiler plugin.

A quick explanation, we have a "commons" lib that includes both Logger and LoggerFactory. We also have two projects valet and "compiler", where valet loads compiler via FallbackClassLoader. This Fallback delegates off to the valet class loader for two classes, one for an interface that is shared by both, which makes sense, and for LoggerFactory, the exact reason for this latter is unknown to me, but is possibly related to the configuration.

Anyway, LoggerFactory has a getLogger method, which returns (surprise!) a Logger. And now we get to the bit that I'm surprised actually works, but the LoggerFactory, (from valet) will return a Logger from valet. However, when compiler asks for the Logger class, it will get one from the "compiler" class path, which should be a different class. However, I would expect that this should give us a ClassCastException somewhere, instead we seem to be getting these "constraint violations", which we ignore, and which then later cause the JVM to crash. This then gives us an easy workaround, also pass passthrough the other classes that are needed, so that we don't do this only when the linkage error occurs.

Since we don't have the classObject for the bad class, it will be more difficult since that data will be thrown away after the loading constraint violation. The only way I can think of finding it right now is by creating a patched build for you to try. If you like I can create that for you.

I don't think that this is required for the moment, at least not assuming that my above story makes some sense to you :-)

In any case, based on this analysis, I have recreated what I am pretty sure is a reproduction of this case, and attached it here. To run this, simply unpack the zip and run runtest.sh. The code is pretty short, so you can easily enough verify that this is safe, it's an infinite loop, but on my laptop and docker instance this crashes within a few seconds, not getting past about 460 odd iterations on either my Mac or docker Java 11.0.8. Java 8 on the Mac gets to about 750 iterations, and java 8 on Linux doesn't seem to crash at all… at least not within my patience limits :-D

Github-openj9-10541-reproduction.zip

pwagland commented 4 years ago

Hi @tajila, I see that you have a fix for the crashing issue. Thanks a heap for that! Do you know what release this will make it into?

tajila commented 4 years ago

@pwagland Its the fix is currently being reviewed then it has to be tested. Ill try to get it in for the 0.23 Milestone (deadline Oct 1) which should be released sometime late October. Otherwise, you can get a binary from adoptopenjdk nightlies as soon as the PR is merged.