Closed ForceRs closed 2 months ago
I would recommend to open a Sales Force case here : https://w3.ibm.com/w3publisher/ibm_runtime_technologies/support/reporting-java-issues
The following link explains how to collect MustGather data:
https://www.ibm.com/support/pages/mustgather-read-first-runtimes-java-technology
@manqingl FYI
If you by some reason can't / don't want open case with Service you can collect system cores, javacores, GC verbose logs from Java 11 and Java 8 runs both and provide them to us directly (myself or @pshipton). Files for working case Java 8 can be collected at vmstop
event.
From my experience Java 11 takes more object heap space for initialization, we have increased -Xmx for a few very tiny tests.
BTW is it working with slightly larger -Xmx (stabilized there) or there is out-of-control memory leak?
I'll look into the Sales Force case. BTW, https://w3.ibm.com/w3publisher/ibm_runtime_technologies/support/reporting-java-issues fails with message: "Hmm. We’re having trouble finding that site." The MustGather link is okay.
It is an out-of-control memory leak. It runs about twice as long if I double the heap.
@ForceRs I just opened it (https://w3.ibm.com/w3publisher/ibm_runtime_technologies/support/reporting-java-issues) and it was fine. Are you an IBM internal?
No. Not IBM internal. We're a vendor and IBM Business Partner.
I tried to go to https://w3.ibm.com/w3publisher/ibm_runtime_technologies/support/reporting-java-issues from all my browsers, both from work and home PCs, but all failed (even just https://w3.ibm.com fails). I also tried to access it from my phone with no Wi-Fi connection (thinking some kind of filtering was preventing the connection), but no go. II'll send stuff to dmitripivkine, as suggested Aug 22, 2024, 12:41 PM.
Based on the call site information, this appears to be related to JNI local references (particularly JNI reference frames). I can see no reason why this would differ between JDK levels.
I am actively gathering the requested information. We hit a speed bump with Transaction dump (TDump) generation. Working with our Systems Programmer to resolve it. I hope to provide files 26Aug2024.
dmitripivkine, I have the requested diagnostic files. I've zipped them up into a 397MB file. How do I privately share the file with you? I'd attach it here, but it's large and there's some SDSF output we'd rather not make public.
You can use any file sharing service and send me the link over email Dmitri_Pivkine@ca.ibm.com. Alternatively if you use Slack you can try to find me in Eclipse OpenJ9 workspace and download file there (I am not sure about limits).
Email with download instructions sent.
@joransiu fyi
This is a heap space out of memory. Have you tried looking at the heapdump files with MAT (https://eclipse.dev/mat/) to see what's consuming the heap and growing?
I used the IBM Heap Analyzer. It didn't report any leaks. It did warn about many root objects. If I double the heap size in Java 11, it takes twice as long to run out of memory. It never runs out of memory using Java 8.
jnicsup.cpp:1964 is jniLocalReferences.
From EOJ-OutOfMemory/javacore.20240826.094436.16908590.0015.txt
1TISIGINFO Dump Event "systhrow" (00040000) Detail "java/lang/OutOfMemoryError" "Java heap space" received
z/OS-Release-11.0.23.0
1STHEAPTOTAL Total memory: 8388608 (0x0000000000800000)
1STHEAPINUSE Total memory in use: 7674288 (0x00000000007519B0)
1STHEAPFREE Total memory free: 714320 (0x00000000000AE650)
1STGCHTYPE GC History
3STHSTTYPE 14:44:36:885085101 GMT j9mm.100 - J9AllocateObject() returning NULL! 16 bytes requested for object of class 00000000326FA900 from memory space 'Generational' id=000000504CC58208
3STHSTTYPE 14:44:36:884238429 GMT j9mm.134 - Allocation failure end: newspace=718416/1835008 oldspace=0/6553600 loa=0/65536
3STHSTTYPE 14:44:36:883775257 GMT j9mm.470 - Allocation failure cycle end: newspace=718416/1835008 oldspace=0/6553600 loa=0/65536
3STHSTTYPE 14:44:36:883123632 GMT j9mm.475 - GlobalGC end: workstackoverflow=0 overflowcount=0 memory=718416/8388608
3STHSTTYPE 14:44:36:881287429 GMT j9mm.90 - GlobalGC collect complete
3STHSTTYPE 14:44:36:880396273 GMT j9mm.137 - Compact end: bytesmoved=0
3STHSTTYPE 14:44:36:738973992 GMT j9mm.136 - Compact start: reason=low free space (less than 4%)
3STHSTTYPE 14:44:36:738437304 GMT j9mm.57 - Sweep end
3STHSTTYPE 14:44:36:732883539 GMT j9mm.56 - Sweep start
3STHSTTYPE 14:44:36:732435960 GMT j9mm.94 - Class unloading end: classloadersunloaded=0 classesunloaded=0
3STHSTTYPE 14:44:36:732308773 GMT j9mm.60 - Class unloading start
3STHSTTYPE 14:44:36:730236132 GMT j9mm.55 - Mark end
3STHSTTYPE 14:44:36:616182070 GMT j9mm.54 - Mark start
3STHSTTYPE 14:44:36:613724164 GMT j9mm.474 - GlobalGC start: globalcount=35
3STHSTTYPE 14:44:36:611503351 GMT j9mm.475 - GlobalGC end: workstackoverflow=0 overflowcount=0 memory=718416/8388608
3STHSTTYPE 14:44:36:609556632 GMT j9mm.90 - GlobalGC collect complete
3STHSTTYPE 14:44:36:608519429 GMT j9mm.137 - Compact end: bytesmoved=16976
3STHSTTYPE 14:44:36:488491039 GMT j9mm.136 - Compact start: reason=low free space (less than 4%)
3STHSTTYPE 14:44:36:487928757 GMT j9mm.57 - Sweep end
3STHSTTYPE 14:44:36:480018789 GMT j9mm.56 - Sweep start
3STHSTTYPE 14:44:36:479222132 GMT j9mm.55 - Mark end
3STHSTTYPE 14:44:36:368230273 GMT j9mm.54 - Mark start
3STHSTTYPE 14:44:36:365507882 GMT j9mm.474 - GlobalGC start: globalcount=34
3STHSTTYPE 14:44:36:364826320 GMT j9mm.135 - Exclusive access: exclusiveaccessms=0.059 meanexclusiveaccessms=0.059 threads=0 lastthreadtid=0x000000003296A380 beatenbyotherthread=0
3STHSTTYPE 14:44:36:364811726 GMT j9mm.469 - Allocation failure cycle start: newspace=693760/1835008 oldspace=0/6553600 loa=0/65536 requestedbytes=16
3STHSTTYPE 14:44:36:364308523 GMT j9mm.133 - Allocation failure start: newspace=693760/1835008 oldspace=0/6553600 loa=0/65536 requestedbytes=16
3STHSTTYPE 14:44:36:164687945 GMT j9mm.51 - SystemGC end: newspace=706048/1835008 oldspace=944/6553600 loa=0/655363STHSTTYPE 14:44:36:164014460 GMT j9mm.475 - GlobalGC end: workstackoverflow=0 overflowcount=0 memory=706992/8388608
3STHSTTYPE 14:44:36:162284867 GMT j9mm.90 - GlobalGC collect complete
3STHSTTYPE 14:44:36:161296664 GMT j9mm.137 - Compact end: bytesmoved=0
3STHSTTYPE 14:44:36:046987320 GMT j9mm.136 - Compact start: reason=low free space (less than 4%)
3STHSTTYPE 14:44:36:046458726 GMT j9mm.57 - Sweep end
3STHSTTYPE 14:44:36:040465664 GMT j9mm.56 - Sweep start
... there is more
That's the question... Why does this happen with Java 11 and not Java 8. The Java 8 javacore files show no such issue (Java8\Snap-after 20 minutes\javacore.20240826.124744.16908357.0015.txt): 1STHEAPTOTAL Total memory: 8388608 (0x0000000000800000) 1STHEAPINUSE Total memory in use: 3418984 (0x0000000000342B68) 1STHEAPFREE Total memory free: 4969624 (0x00000000004BD498)
The test performed for both Java 8 and Java 11 is identical.
Probably not it, but to rule it out you can try running Java 11 with -Xshareclasses:none
. It's enabled by default in Java 11 but not 8.
Another difference I noticed looking at https://www.ibm.com/support/pages/semeru-runtimes-migration-guide "The count field is removed from String, which means that substring() with a beginIndex value of zero can no longer be shared. If performance is significantly degraded because char[] data is now being copied, try re-implementing the code to avoid copying the String data."
I'll add -Xshareclasses:none and test. I'll then investigate my use of substring(). I'll post my findings tomorrow. Thank you.
See https://eclipse.dev/openj9/docs/djavalangstringsubstringnocopy/
You can run Java 8 with -Djava.lang.string.substring.nocopy=true
to see if that introduces OOM.
Just curious, is the Java 8 that you are using to test also 64-bit as well?
Yes, Java 8 is also 64-bit . I'll post the exact version tomorrow morning.
It's 8.0.7.20, it's in the diagnostic files we have. You could try a later version of IBM Java 8, such as 8.0.8.25, which ~I think~ lines up with the 11.0.23 you are using. If that one has the OOM problem then we introduced a bug between the two releases.
Will do.
pshipton: I ran with -Xshareclasses:none, but as you suspected, it failed with OutOfMemoryError. pshipton: I am using substring(0, nnn) in the application.
joransiu: My current Java 8 version is: java version "1.8.0_351" Java(TM) SE Runtime Environment (build 8.0.7.20 - pmz6480sr7fp20-20221020_01(SR7 FP20)) IBM J9 VM (build 2.9, JRE 1.8.0 z/OS s390x-64-Bit Compressed References 20220929_37824 (JIT enabled, AOT enabled) OpenJ9 - 02180fe OMR - 48fc32a IBM - bf759bf) JCL - 20220922_01 based on Oracle jdk8u351-b10
Items to-do and report back on:
Running with my current (SR7 FP20) Java 8 version with -Djava.lang.string.substring.nocopy=true did not cause an OutOfMemoryError.
We installed the latest Java 8. Here's the full version: java version "1.8.0_421" Java(TM) SE Runtime Environment (build 8.0.8.30 - pmz6480sr8fp30-20240801_01(SR8 FP30)) IBM J9 VM (build 2.9, JRE 1.8.0 z/OS s390x-64-Bit Compressed References 20240703_73934 (JIT enabled, AOT enabled) OpenJ9 - 177ad46 OMR - e74814c IBM - 3c87141) JCL - 20240731_02 based on Oracle jdk8u421-b09
I ran my standard test using it and encountered no issues. I added a counter to see how often substring(0, nnn) is being called and it's minimal. Only 299 times in 30 minutes; in fact, 297 of those were done at startup and only 2 after that with the application under heavy load. I assume that proves substring(0, nnn) isn't the cause; I will not pursue changing to another parsing technique unless you disagree.
I think I've tried all the recommendations. What do you recommend next?
Are you able to open a case with IBM support?
I can open one, but my experience with IBM and Java bugs is not good. I opened one on November 16, 2018, and we still haven't gotten a fix.
Is the jnicsup.cpp:1964 a clue? I ran with -Xcheck:jni and it didn't show any errors. But why is jnicsup.cpp:1964 so high? Or am I reading that wrong?
It indicates there are a lot of JNI local references, but I don't know why that is. I assume you have some JNI code, if you want to share it I can take a look to see if there is anything obvious. Except for (1) you tried mismatched versions of jdk8/jdk11, and (2) differences to support the jdk11 spec, the Virtual Machine/JNI implementation/Enable3164Interoperability (OpenJ9) is mostly the same between jdk8 and jdk11. Maybe the difference is something to do with modularity, such as new modules being continuously created each time you enter Java?
I made an attempt to open your system core file with MAT, but it didn't open and I'm not sure how to fix it. MAT on a system core enables looking at the content of objects/char[]/byte[]. If we can figure out what the JNI local references are, that would help figure out where they come from. I think your best bet is to open an IBM Support case, they will know how to get MAT working, and have more experience looking at this type of problem.
@ForceRs : jnicsup.cpp:1964 is a clue. It means the application has lots of JNI local references. The next question would be who was requesting all those JNI local references.
Please open a SalesForce case and let me know the case number. I will make sure someone takes care of it.
@manqingl: Will do. Going to re-inspect all JNI-related code first.
@pshipton : You can open the provided heap dumps (*.phd) files in eclipse using MAT after installing the IBM DTFJ feature for IBM dumps. Unfortunately, my phd files don't seem to provide memory contents. That is, the Inspector's "Value" tab is empty when I look at the char[] arrays. I'm going to look into producing hprof files on z/OS. I see jmap in Java 11's bin folder.
I installed the IBM feature, but still got an error opening the core. It could be the dtfj component is out of date. I expect service will figure it out.
jmap/OpenJ9 doesn't support hprof.
The Customer has created Case TS017177781 for this issue. I have reviewed the documentation and created Skills Case TS017199872 on the Java Technology Mission.
For the record, I was trying to use the Eclipse MAT, but we need to use MAT from https://www.ibm.com/support/pages/eclipse-memory-analyzer-tool-dtfj-and-ibm-extensions
FYI : we have identified a bug in the implementation of JNI API IBMZOS_NewStringPlatform where the references are not getting cleared before returning back to the caller in Semeru 11 but are getting cleared in 8. The JCL team is preparing a test patch to verify the fix. @yathamravali FYI.
@pshipton , There is an issue with MAT (on Citrix). Please see my (internal) update in the Case describing how to use the fix for this. Also I've updated the Case (11th September) with the most common char[] contents, which match the string that the Customer later told me they used to create the OOM with their testcase. @manqingl , thanks, I've uploaded the testfix to the Case for the Customer to test.
java version "11.0.23" 2024-04-16 IBM Semeru Runtime Certified Edition for z/OS 11.0.23.0 (build 11.0.23+9) IBM J9 VM 11.0.23.0 (build z/OS-Release-11.0.23.0-b02, JRE 11 z/OS s390x-64-Bit Compressed References 20240528_334 (JIT enabled, AOT enabled) OpenJ9 - aa39565b36f OMR - e4ae704bb5e IBM - 3c87141 JCL - b8bbe79f173 based on jdk-11.0.23+9)
Background We have a small Java application that interacts with a Derby DB. Basically, we have some HLASM that interacts with COBOL; the COBOL interacts with Java via JNI. The COBOL JNI is native; that is, it calls CallStaticBooleanMethod, not INVOKE. One thing that may be considered unusual is that we use 64-bit Java to interact with 31-bit COBOL via -XX:+Enable3164Interoperability.
Summary of problem When running our small Java application, it runs out of memory after about 15 minutes. This happens in Java 11, not Java 8. I'm unable to test Java 17, as our z/OS is too old to support it. I say small Java application because it requires a tiny heap of -Xms4m -Xmx8m. I am running with REGION=0M; thus, MEMLIMIT should not be a factor. Using the IBM HeapAnalyzer, I see normal stuff and then about 70,000 root objects, mostly of type byte[] and char[].
Diagnostics I saw a somewhat similar issue posted here and followed its advice to provide some initial diagnostics. I can reproduce this issue at will, so further diagnostics can be provided.
I suspected a JNI issue, so I first ran with -Xcheck:jni. This produced no output other than the confirmation line proving it was accepted.
I next ran using -Xcheck:memory:quick,noscan,callsite=500. The file for this output is sdsf.output.callsite-J11.txt. If I read the file correctly, and I may not be, then it seems to indicate a leak in jnicsup.cpp at line 1964. Each line below is a different point in time. The first is the first one, the last is the last one, and the middle one is an arbitrarily time.
File sdsf.output.callsite-J8.txt contains output for the exact same test run using Java 8 with -Xcheck:memory:quick,noscan,callsite=500. That run appears to be fine. sdsf.output.callsite-J8.txt's line wrapping is kind of messed up -- sorry.
One other thing to note is that two JVMs are spawned in our use case. One JVM is relatively idle. The other is busy. Since we only provide one set of exports (i.e., COBJVMINITOPTIONS=-Xms4m -Xmx8m -Xquickstart -XX:+Enable3164Interoperability -Xcheck:memory:quick,noscan,callsite=500), both the idle and the busy JVM produce diagnostics. If you look at just the output for jnicsup.cpp:1964, you'll see that it grows, but then appears to drop suddenly; that is not the case. What you're seeing is the idle JVM interspersed with the busy JVM.