Memory leak/OutOfMemoryError when using 11.0.23+9 IBM Semeru Runtime Certified Edition for z/OS

ForceRs commented 3 months ago

java version "11.0.23" 2024-04-16 IBM Semeru Runtime Certified Edition for z/OS 11.0.23.0 (build 11.0.23+9) IBM J9 VM 11.0.23.0 (build z/OS-Release-11.0.23.0-b02, JRE 11 z/OS s390x-64-Bit Compressed References 20240528_334 (JIT enabled, AOT enabled) OpenJ9 - aa39565b36f OMR - e4ae704bb5e IBM - 3c87141 JCL - b8bbe79f173 based on jdk-11.0.23+9)

Background We have a small Java application that interacts with a Derby DB. Basically, we have some HLASM that interacts with COBOL; the COBOL interacts with Java via JNI. The COBOL JNI is native; that is, it calls CallStaticBooleanMethod, not INVOKE. One thing that may be considered unusual is that we use 64-bit Java to interact with 31-bit COBOL via -XX:+Enable3164Interoperability.

Summary of problem When running our small Java application, it runs out of memory after about 15 minutes. This happens in Java 11, not Java 8. I'm unable to test Java 17, as our z/OS is too old to support it. I say small Java application because it requires a tiny heap of -Xms4m -Xmx8m. I am running with REGION=0M; thus, MEMLIMIT should not be a factor. Using the IBM HeapAnalyzer, I see normal stuff and then about 70,000 root objects, mostly of type byte[] and char[].

Diagnostics I saw a somewhat similar issue posted here and followed its advice to provide some initial diagnostics. I can reproduce this issue at will, so further diagnostics can be provided.

I suspected a JNI issue, so I first ran with -Xcheck:jni. This produced no output other than the confirmation line proving it was accepted.

I next ran using -Xcheck:memory:quick,noscan,callsite=500. The file for this output is sdsf.output.callsite-J11.txt. If I read the file correctly, and I may not be, then it seems to indicate a leak in jnicsup.cpp at line 1964. Each line below is a different point in time. The first is the first one, the last is the last one, and the middle one is an arbitrarily time.

  total alloc   | total freed   | delta alloc   | delta freed   | high water    | largest
  blocks| bytes | blocks| bytes | blocks| bytes | blocks| bytes | blocks| bytes | bytes | num   | callsite
 -------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+------------
       6    1888       6    1888       6    1888       6    1888       6    1888     448       3 /u/jenkins/workspace/Build_JDK11_s390x_zos_Release/openj9/runtime/vm/jnicsup.cpp:1964
     154   61560      76   28216       1     448       0       0      78   33344    2008       9 /u/jenkins/workspace/Build_JDK11_s390x_zos_Release/openj9/runtime/vm/jnicsup.cpp:1964
    7657 3422104    2579 1149560       0       0       0       0    5078 2272544    2008       9 /u/jenkins/workspace/Build_JDK11_s390x_zos_Release/openj9/runtime/vm/jnicsup.cpp:1964

File sdsf.output.callsite-J8.txt contains output for the exact same test run using Java 8 with -Xcheck:memory:quick,noscan,callsite=500. That run appears to be fine. sdsf.output.callsite-J8.txt's line wrapping is kind of messed up -- sorry.

One other thing to note is that two JVMs are spawned in our use case. One JVM is relatively idle. The other is busy. Since we only provide one set of exports (i.e., COBJVMINITOPTIONS=-Xms4m -Xmx8m -Xquickstart -XX:+Enable3164Interoperability -Xcheck:memory:quick,noscan,callsite=500), both the idle and the busy JVM produce diagnostics. If you look at just the output for jnicsup.cpp:1964, you'll see that it grows, but then appears to drop suddenly; that is not the case. What you're seeing is the idle JVM interspersed with the busy JVM.

dmitripivkine commented 3 months ago

I would recommend to open a Sales Force case here : https://w3.ibm.com/w3publisher/ibm_runtime_technologies/support/reporting-java-issues

The following link explains how to collect MustGather data:

https://www.ibm.com/support/pages/mustgather-read-first-runtimes-java-technology

@manqingl FYI

dmitripivkine commented 3 months ago

If you by some reason can't / don't want open case with Service you can collect system cores, javacores, GC verbose logs from Java 11 and Java 8 runs both and provide them to us directly (myself or @pshipton). Files for working case Java 8 can be collected at vmstop event. From my experience Java 11 takes more object heap space for initialization, we have increased -Xmx for a few very tiny tests.

dmitripivkine commented 3 months ago

BTW is it working with slightly larger -Xmx (stabilized there) or there is out-of-control memory leak?

ForceRs commented 3 months ago

I'll look into the Sales Force case. BTW, https://w3.ibm.com/w3publisher/ibm_runtime_technologies/support/reporting-java-issues fails with message: "Hmm. We’re having trouble finding that site." The MustGather link is okay.

It is an out-of-control memory leak. It runs about twice as long if I double the heap.

manqingl commented 3 months ago

@ForceRs I just opened it (https://w3.ibm.com/w3publisher/ibm_runtime_technologies/support/reporting-java-issues) and it was fine. Are you an IBM internal?

ForceRs commented 3 months ago

No. Not IBM internal. We're a vendor and IBM Business Partner.

ForceRs commented 3 months ago

I tried to go to https://w3.ibm.com/w3publisher/ibm_runtime_technologies/support/reporting-java-issues from all my browsers, both from work and home PCs, but all failed (even just https://w3.ibm.com fails). I also tried to access it from my phone with no Wi-Fi connection (thinking some kind of filtering was preventing the connection), but no go. II'll send stuff to dmitripivkine, as suggested Aug 22, 2024, 12:41 PM.

gacholio commented 3 months ago

Based on the call site information, this appears to be related to JNI local references (particularly JNI reference frames). I can see no reason why this would differ between JDK levels.

ForceRs commented 3 months ago

I am actively gathering the requested information. We hit a speed bump with Transaction dump (TDump) generation. Working with our Systems Programmer to resolve it. I hope to provide files 26Aug2024.

ForceRs commented 2 months ago

dmitripivkine, I have the requested diagnostic files. I've zipped them up into a 397MB file. How do I privately share the file with you? I'd attach it here, but it's large and there's some SDSF output we'd rather not make public.

dmitripivkine commented 2 months ago

You can use any file sharing service and send me the link over email Dmitri_Pivkine@ca.ibm.com. Alternatively if you use Slack you can try to find me in Eclipse OpenJ9 workspace and download file there (I am not sure about limits).

ForceRs commented 2 months ago

Email with download instructions sent.

pshipton commented 2 months ago

@joransiu fyi

pshipton commented 2 months ago

This is a heap space out of memory. Have you tried looking at the heapdump files with MAT (https://eclipse.dev/mat/) to see what's consuming the heap and growing?

ForceRs commented 2 months ago

I used the IBM Heap Analyzer. It didn't report any leaks. It did warn about many root objects. If I double the heap size in Java 11, it takes twice as long to run out of memory. It never runs out of memory using Java 8.

pshipton commented 2 months ago

jnicsup.cpp:1964 is jniLocalReferences.

From EOJ-OutOfMemory/javacore.20240826.094436.16908590.0015.txt

1TISIGINFO Dump Event "systhrow" (00040000) Detail "java/lang/OutOfMemoryError" "Java heap space" received z/OS-Release-11.0.23.0

1STHEAPTOTAL   Total memory:                     8388608 (0x0000000000800000)
1STHEAPINUSE   Total memory in use:              7674288 (0x00000000007519B0)
1STHEAPFREE    Total memory free:                 714320 (0x00000000000AE650)

1STGCHTYPE     GC History
3STHSTTYPE     14:44:36:885085101 GMT j9mm.100 -   J9AllocateObject() returning NULL! 16 bytes requested for object of class 00000000326FA900 from memory space 'Generational' id=000000504CC58208
3STHSTTYPE     14:44:36:884238429 GMT j9mm.134 -   Allocation failure end: newspace=718416/1835008 oldspace=0/6553600 loa=0/65536
3STHSTTYPE     14:44:36:883775257 GMT j9mm.470 -   Allocation failure cycle end: newspace=718416/1835008 oldspace=0/6553600 loa=0/65536
3STHSTTYPE     14:44:36:883123632 GMT j9mm.475 -   GlobalGC end: workstackoverflow=0 overflowcount=0 memory=718416/8388608
3STHSTTYPE     14:44:36:881287429 GMT j9mm.90 -   GlobalGC collect complete
3STHSTTYPE     14:44:36:880396273 GMT j9mm.137 -   Compact end: bytesmoved=0
3STHSTTYPE     14:44:36:738973992 GMT j9mm.136 -   Compact start: reason=low free space (less than 4%)
3STHSTTYPE     14:44:36:738437304 GMT j9mm.57 -   Sweep end
3STHSTTYPE     14:44:36:732883539 GMT j9mm.56 -   Sweep start
3STHSTTYPE     14:44:36:732435960 GMT j9mm.94 -   Class unloading end: classloadersunloaded=0 classesunloaded=0
3STHSTTYPE     14:44:36:732308773 GMT j9mm.60 -   Class unloading start
3STHSTTYPE     14:44:36:730236132 GMT j9mm.55 -   Mark end
3STHSTTYPE     14:44:36:616182070 GMT j9mm.54 -   Mark start
3STHSTTYPE     14:44:36:613724164 GMT j9mm.474 -   GlobalGC start: globalcount=35
3STHSTTYPE     14:44:36:611503351 GMT j9mm.475 -   GlobalGC end: workstackoverflow=0 overflowcount=0 memory=718416/8388608
3STHSTTYPE     14:44:36:609556632 GMT j9mm.90 -   GlobalGC collect complete
3STHSTTYPE     14:44:36:608519429 GMT j9mm.137 -   Compact end: bytesmoved=16976
3STHSTTYPE     14:44:36:488491039 GMT j9mm.136 -   Compact start: reason=low free space (less than 4%)
3STHSTTYPE     14:44:36:487928757 GMT j9mm.57 -   Sweep end
3STHSTTYPE     14:44:36:480018789 GMT j9mm.56 -   Sweep start
3STHSTTYPE     14:44:36:479222132 GMT j9mm.55 -   Mark end
3STHSTTYPE     14:44:36:368230273 GMT j9mm.54 -   Mark start
3STHSTTYPE     14:44:36:365507882 GMT j9mm.474 -   GlobalGC start: globalcount=34
3STHSTTYPE     14:44:36:364826320 GMT j9mm.135 -   Exclusive access: exclusiveaccessms=0.059 meanexclusiveaccessms=0.059 threads=0 lastthreadtid=0x000000003296A380 beatenbyotherthread=0
3STHSTTYPE     14:44:36:364811726 GMT j9mm.469 -   Allocation failure cycle start: newspace=693760/1835008 oldspace=0/6553600 loa=0/65536 requestedbytes=16
3STHSTTYPE     14:44:36:364308523 GMT j9mm.133 -   Allocation failure start: newspace=693760/1835008 oldspace=0/6553600 loa=0/65536 requestedbytes=16
3STHSTTYPE     14:44:36:164687945 GMT j9mm.51 -   SystemGC end: newspace=706048/1835008 oldspace=944/6553600 loa=0/655363STHSTTYPE     14:44:36:164014460 GMT j9mm.475 -   GlobalGC end: workstackoverflow=0 overflowcount=0 memory=706992/8388608
3STHSTTYPE     14:44:36:162284867 GMT j9mm.90 -   GlobalGC collect complete
3STHSTTYPE     14:44:36:161296664 GMT j9mm.137 -   Compact end: bytesmoved=0
3STHSTTYPE     14:44:36:046987320 GMT j9mm.136 -   Compact start: reason=low free space (less than 4%)
3STHSTTYPE     14:44:36:046458726 GMT j9mm.57 -   Sweep end
3STHSTTYPE     14:44:36:040465664 GMT j9mm.56 -   Sweep start
... there is more

ForceRs commented 2 months ago

That's the question... Why does this happen with Java 11 and not Java 8. The Java 8 javacore files show no such issue (Java8\Snap-after 20 minutes\javacore.20240826.124744.16908357.0015.txt): 1STHEAPTOTAL Total memory: 8388608 (0x0000000000800000) 1STHEAPINUSE Total memory in use: 3418984 (0x0000000000342B68) 1STHEAPFREE Total memory free: 4969624 (0x00000000004BD498)

The test performed for both Java 8 and Java 11 is identical.

pshipton commented 2 months ago

Probably not it, but to rule it out you can try running Java 11 with -Xshareclasses:none. It's enabled by default in Java 11 but not 8.

Another difference I noticed looking at https://www.ibm.com/support/pages/semeru-runtimes-migration-guide "The count field is removed from String, which means that substring() with a beginIndex value of zero can no longer be shared. If performance is significantly degraded because char[] data is now being copied, try re-implementing the code to avoid copying the String data."

ForceRs commented 2 months ago

I'll add -Xshareclasses:none and test. I'll then investigate my use of substring(). I'll post my findings tomorrow. Thank you.

pshipton commented 2 months ago

See https://eclipse.dev/openj9/docs/djavalangstringsubstringnocopy/ You can run Java 8 with -Djava.lang.string.substring.nocopy=true to see if that introduces OOM.

joransiu commented 2 months ago

Just curious, is the Java 8 that you are using to test also 64-bit as well?

ForceRs commented 2 months ago

Yes, Java 8 is also 64-bit . I'll post the exact version tomorrow morning.

pshipton commented 2 months ago

It's 8.0.7.20, it's in the diagnostic files we have. You could try a later version of IBM Java 8, such as 8.0.8.25, which ~I think~ lines up with the 11.0.23 you are using. If that one has the OOM problem then we introduced a bug between the two releases.

ForceRs commented 2 months ago

Will do.

ForceRs commented 2 months ago

pshipton: I ran with -Xshareclasses:none, but as you suspected, it failed with OutOfMemoryError. pshipton: I am using substring(0, nnn) in the application.

joransiu: My current Java 8 version is: java version "1.8.0_351" Java(TM) SE Runtime Environment (build 8.0.7.20 - pmz6480sr7fp20-20221020_01(SR7 FP20)) IBM J9 VM (build 2.9, JRE 1.8.0 z/OS s390x-64-Bit Compressed References 20220929_37824 (JIT enabled, AOT enabled) OpenJ9 - 02180fe OMR - 48fc32a IBM - bf759bf) JCL - 20220922_01 based on Oracle jdk8u351-b10

Items to-do and report back on:

Run with current (SR7 FP20) Java 8 version with -Djava.lang.string.substring.nocopy=true to try to reproduce the OutOfMemoryError.
Download and install the latest Java 8 (SR8 FP30) and try to reproduce OutOfMemoryError.
Implement a quick-and-dirty code change (via chicken switch) to not use substring(0, nnn).

ForceRs commented 2 months ago

Running with my current (SR7 FP20) Java 8 version with -Djava.lang.string.substring.nocopy=true did not cause an OutOfMemoryError.

ForceRs commented 2 months ago

We installed the latest Java 8. Here's the full version: java version "1.8.0_421" Java(TM) SE Runtime Environment (build 8.0.8.30 - pmz6480sr8fp30-20240801_01(SR8 FP30)) IBM J9 VM (build 2.9, JRE 1.8.0 z/OS s390x-64-Bit Compressed References 20240703_73934 (JIT enabled, AOT enabled) OpenJ9 - 177ad46 OMR - e74814c IBM - 3c87141) JCL - 20240731_02 based on Oracle jdk8u421-b09

I ran my standard test using it and encountered no issues. I added a counter to see how often substring(0, nnn) is being called and it's minimal. Only 299 times in 30 minutes; in fact, 297 of those were done at startup and only 2 after that with the application under heavy load. I assume that proves substring(0, nnn) isn't the cause; I will not pursue changing to another parsing technique unless you disagree.

I think I've tried all the recommendations. What do you recommend next?

pshipton commented 2 months ago

Are you able to open a case with IBM support?

ForceRs commented 2 months ago

I can open one, but my experience with IBM and Java bugs is not good. I opened one on November 16, 2018, and we still haven't gotten a fix.

Is the jnicsup.cpp:1964 a clue? I ran with -Xcheck:jni and it didn't show any errors. But why is jnicsup.cpp:1964 so high? Or am I reading that wrong?

pshipton commented 2 months ago

It indicates there are a lot of JNI local references, but I don't know why that is. I assume you have some JNI code, if you want to share it I can take a look to see if there is anything obvious. Except for (1) you tried mismatched versions of jdk8/jdk11, and (2) differences to support the jdk11 spec, the Virtual Machine/JNI implementation/Enable3164Interoperability (OpenJ9) is mostly the same between jdk8 and jdk11. Maybe the difference is something to do with modularity, such as new modules being continuously created each time you enter Java?

pshipton commented 2 months ago

I made an attempt to open your system core file with MAT, but it didn't open and I'm not sure how to fix it. MAT on a system core enables looking at the content of objects/char[]/byte[]. If we can figure out what the JNI local references are, that would help figure out where they come from. I think your best bet is to open an IBM Support case, they will know how to get MAT working, and have more experience looking at this type of problem.

manqingl commented 2 months ago

@ForceRs : jnicsup.cpp:1964 is a clue. It means the application has lots of JNI local references. The next question would be who was requesting all those JNI local references.

Please open a SalesForce case and let me know the case number. I will make sure someone takes care of it.

ForceRs commented 2 months ago

@manqingl: Will do. Going to re-inspect all JNI-related code first.

ForceRs commented 2 months ago

@pshipton : You can open the provided heap dumps (*.phd) files in eclipse using MAT after installing the IBM DTFJ feature for IBM dumps. Unfortunately, my phd files don't seem to provide memory contents. That is, the Inspector's "Value" tab is empty when I look at the char[] arrays. I'm going to look into producing hprof files on z/OS. I see jmap in Java 11's bin folder.

pshipton commented 2 months ago

I installed the IBM feature, but still got an error opening the core. It could be the dtfj component is out of date. I expect service will figure it out.

jmap/OpenJ9 doesn't support hprof.

flavell commented 2 months ago

The Customer has created Case TS017177781 for this issue. I have reviewed the documentation and created Skills Case TS017199872 on the Java Technology Mission.

pshipton commented 2 months ago

For the record, I was trying to use the Eclipse MAT, but we need to use MAT from https://www.ibm.com/support/pages/eclipse-memory-analyzer-tool-dtfj-and-ibm-extensions

manqingl commented 2 months ago

FYI : we have identified a bug in the implementation of JNI API IBMZOS_NewStringPlatform where the references are not getting cleared before returning back to the caller in Semeru 11 but are getting cleared in 8. The JCL team is preparing a test patch to verify the fix. @yathamravali FYI.

flavell commented 2 months ago

@pshipton , There is an issue with MAT (on Citrix). Please see my (internal) update in the Case describing how to use the fix for this. Also I've updated the Case (11th September) with the most common char[] contents, which match the string that the Customer later told me they used to create the OOM with their testcase. @manqingl , thanks, I've uploaded the testfix to the Case for the Customer to test.

ForceRs commented 2 months ago

IBM has resolved this issue. It will be in the November 2024 release. See APAR. Thank you so much for all your help, and especially for the swift resolution!

eclipse-openj9 / openj9

Memory leak/OutOfMemoryError when using 11.0.23+9 IBM Semeru Runtime Certified Edition for z/OS #20043