eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 721 forks source link

Regression since 0.46. Unhandled exception - Type=Bus error vmState=0x000501ff #20239

Closed krismarc closed 1 day ago

krismarc commented 2 weeks ago

We are getting following error while using OpenJ9 from IBM's Semeru runtime. We have tested the newest minor release (0.46.1) and this yet happens. Basically the problem started to happen after 0.45 OpenJ9 release. 0.45 was yet fine.

$ ./java --version
openjdk 17.0.12 2024-07-16
IBM Semeru Runtime Open Edition 17.0.12.1 (build 17.0.12+7)
Eclipse OpenJ9 VM 17.0.12.1 (build openj9-0.46.1, JRE 17 Linux amd64-64-Bit Compressed References 20240716_840 (JIT enabled, AOT enabled)
OpenJ9   - 4760d5d320
OMR      - 840a9adba
JCL      - e70fd1fd64a based on jdk-17.0.12+7)
ERR #0: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x92fea5) [0x7f2b01445ea5]
ERR #1: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x93c250) [0x7f2b01452250]
ERR #2: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x1633b9) [0x7f2b00c793b9]
ERR #3: /home/vcap/app/.java/lib/default/libj9prt29.so(+0x29978) [0x7f2b019f7978]
ERR #4: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f2b02133520]
ERR #5: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x4a43b9) [0x7f2b00fba3b9]
ERR #6: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x4a4447) [0x7f2b00fba447]
ERR #7: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x4a06bc) [0x7f2b00fb66bc]
ERR #8: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x3bc751) [0x7f2b00ed2751]
ERR #9: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x3bdf61) [0x7f2b00ed3f61]
ERR #10: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x32abfc) [0x7f2b00e40bfc]
ERR #11: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x3a7aaa) [0x7f2b00ebdaaa]
ERR #12: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x3affa7) [0x7f2b00ec5fa7]
ERR #13: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x697c1e) [0x7f2b011adc1e]
ERR #14: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x3a7002) [0x7f2b00ebd002]
ERR #15: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x7adfd5) [0x7f2b012c3fd5]
ERR #16: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x7afcf3) [0x7f2b012c5cf3]
ERR #17: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x59fc45) [0x7f2b010b5c45]
ERR #18: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x1795ff) [0x7f2b00c8f5ff]
ERR #19: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x17a631) [0x7f2b00c90631]
ERR #20: /home/vcap/app/.java/lib/default/libj9prt29.so(+0x2a3c9) [0x7f2b019f83c9]
ERR #21: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x178172) [0x7f2b00c8e172]
ERR #22: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x1784b8) [0x7f2b00c8e4b8]
ERR #23: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x1773db) [0x7f2b00c8d3db]
ERR #24: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x177742) [0x7f2b00c8d742]
ERR #25: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x1777f2) [0x7f2b00c8d7f2]
ERR #26: /home/vcap/app/.java/lib/default/libj9prt29.so(+0x2a3c9) [0x7f2b019f83c9]
ERR #27: /home/vcap/app/.java/lib/default/libj9jit29.so(+0x177bbb) [0x7f2b00c8dbbb]
ERR #28: /home/vcap/app/.java/lib/default/libj9thr29.so(+0xb3a3) [0x7f2b019c03a3]
ERR #29: /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f2b02185ac3]
ERR Unhandled exception
ERR Type=Bus error vmState=0x000501ff
ERR J9Generic_Signal_Number=00000028 Signal_Number=00000007 Error_Value=00000000 Signal_Code=00000002
ERR Handler1=00007F2B01AA3620 Handler2=00007F2B019F7740 InaccessibleAddress=00007F2A95F15010
ERR RDI=00007F2AFC1A4C58 RSI=0000000000000000 RAX=00007F2A95F15000 RBX=00007F29569B5A74
ERR RCX=0000000000000000 RDX=0000000000000100 R8=0000000000000001 R9=0000000000000000
ERR R10=00007F29569B5A74 R11=00007F28C480F740 R12=0000000000013E8D R13=00000000000000B8
ERR R14=0000000000000000 R15=0000000000000001
ERR RIP=00007F2B00FBA3B9 GS=0000 FS=0000 RSP=00007F2AEE6C1250
ERR EFlags=0000000000010206 CS=0033 RBP=00007F2AEDEC7010 ERR=0000000000000006
ERR TRAPNO=000000000000000E OLDMASK=0000000000000000 CR2=00007F2A95F15010
ERR xmm0=0000000000000000 (f: 0.000000, d: 0.000000e+00)
ERR xmm1=0000000040e00000 (f: 1088421888.000000, d: 5.377519e-315)
ERR xmm2=00000000000000db (f: 219.000000, d: 1.082004e-321)
ERR xmm3=00000000000000d3 (f: 211.000000, d: 1.042479e-321)
ERR xmm4=0000000000000000 (f: 0.000000, d: 0.000000e+00)
ERR xmm5=00007f28c4810970 (f: 3296790784.000000, d: 6.907684e-310)
ERR xmm6=0000000002edf718 (f: 49149720.000000, d: 2.428319e-316)
ERR xmm7=00007f28c47f6af0 (f: 3296684800.000000, d: 6.907684e-310)
ERR xmm8=ffffff00ffffff00 (f: 4294967040.000000, d: -nan)
ERR xmm9=7575757575757575 (f: 1970632064.000000, d: 6.444099e+257)
ERR xmm10=757575767575757d (f: 1970632064.000000, d: 6.444104e+257)
ERR xmm11=0000015200000151 (f: 337.000000, d: 7.172346e-312)
ERR xmm12=0000013d00000140 (f: 320.000000, d: 6.726727e-312)
ERR xmm13=000001380000013f (f: 319.000000, d: 6.620627e-312)
ERR xmm14=0000000008001800 (f: 134223872.000000, d: 6.631540e-316)
ERR xmm15=000001420000013b (f: 315.000000, d: 6.832826e-312)
ERR Module=/home/vcap/app/.java/lib/default/libj9jit29.so
ERR Module_base_address=00007F2B00B16000
ERR Method_being_compiled=com/mongodb/internal/async/function/RetryState.doAdvanceOrThrow(Ljava/lang/Throwable;Ljava/util/function/BiFunction;Ljava/util/function/BiPredicate;Z)V
ERR Target=2_90_20240716_818 (Linux 5.15.0-113-generic)
ERR CPU=amd64 (16 logical CPUs) (0xbc29df000 RAM)
ERR ----------- Stack Backtrace -----------
ERR _ZN12TR_IProfiler17findOrCreateEntryEimb+0x179 (0x00007F2B00FBA3B9 [libj9jit29.so+0x4a43b9])
ERR _ZN12TR_IProfiler15profilingSampleEmmbbj.localalias+0x47 (0x00007F2B00FBA447 [libj9jit29.so+0x4a4447])
ERR _ZN12TR_IProfiler12setCallCountEP20TR_OpaqueMethodBlockiiPN2TR11CompilationE.localalias+0x1c (0x00007F2B00FB66BC [libj9jit29.so+0x4a06bc])
ERR _ZN21TR_J9EstimateCodeSize20realEstimateCodeSizeEP13TR_CallTargetP12TR_CallStackbRN2TR6RegionE+0x1861 (0x00007F2B00ED2751 [libj9jit29.so+0x3bc751])
ERR _ZN21TR_J9EstimateCodeSize16estimateCodeSizeEP13TR_CallTargetP12TR_CallStackb.localalias+0x41 (0x00007F2B00ED3F61 [libj9jit29.so+0x3bdf61])
ERR _ZN19TR_EstimateCodeSize17calculateCodeSizeEP13TR_CallTargetP12TR_CallStackb+0xac (0x00007F2B00E40BFC [libj9jit29.so+0x32abfc])
ERR _ZN28TR_MultipleCallTargetInliner13weighCallSiteEP12TR_CallStackP11TR_CallSitebb+0x1fa (0x00007F2B00EBDAAA [libj9jit29.so+0x3a7aaa])
ERR _ZN28TR_MultipleCallTargetInliner17inlineCallTargetsEPN2TR20ResolvedMethodSymbolEP12TR_CallStackP24TR_InnerPreexistenceInfo+0x1757 (0x00007F2B00EC5FA7 [libj9jit29.so+0x3affa7])
ERR _ZN14TR_InlinerBase15performInliningEPN2TR20ResolvedMethodSymbolE+0xae (0x00007F2B011ADC1E [libj9jit29.so+0x697c1e])
ERR _ZN10TR_Inliner7performEv+0x142 (0x00007F2B00EBD002 [libj9jit29.so+0x3a7002])
ERR _ZN3OMR9Optimizer19performOptimizationEPK20OptimizationStrategyiii.localalias+0x855 (0x00007F2B012C3FD5 [libj9jit29.so+0x7adfd5])
ERR _ZN3OMR9Optimizer8optimizeEv+0x1b3 (0x00007F2B012C5CF3 [libj9jit29.so+0x7afcf3])
ERR _ZN3OMR11Compilation7compileEv+0xa25 (0x00007F2B010B5C45 [libj9jit29.so+0x59fc45])
ERR _ZN2TR28CompilationInfoPerThreadBase7compileEP10J9VMThreadPNS_11CompilationEP17TR_ResolvedMethodR11TR_J9VMBaseP19TR_OptimizationPlanRKNS_16SegmentAllocatorE+0x4bf (0x00007F2B00C8F5FF [libj9jit29.so+0x1795ff])
ERR _ZN2TR28CompilationInfoPerThreadBase14wrappedCompileEP13J9PortLibraryPv+0x381 (0x00007F2B00C90631 [libj9jit29.so+0x17a631])
ERR omrsig_protect+0x239 (0x00007F2B019F83C9 [libj9prt29.so+0x2a3c9])
ERR _ZN2TR28CompilationInfoPerThreadBase7compileEP10J9VMThreadP21TR_MethodToBeCompiledRN2J917J9SegmentProviderE+0x372 (0x00007F2B00C8E172 [libj9jit29.so+0x178172])
ERR _ZN2TR24CompilationInfoPerThread12processEntryER21TR_MethodToBeCompiledRN2J917J9SegmentProviderE+0x128 (0x00007F2B00C8E4B8 [libj9jit29.so+0x1784b8])
ERR _ZN2TR24CompilationInfoPerThread14processEntriesEv+0x35b (0x00007F2B00C8D3DB [libj9jit29.so+0x1773db])
ERR _ZN2TR24CompilationInfoPerThread3runEv+0x42 (0x00007F2B00C8D742 [libj9jit29.so+0x177742])
ERR _Z30protectedCompilationThreadProcP13J9PortLibraryPN2TR24CompilationInfoPerThreadE+0x82 (0x00007F2B00C8D7F2 [libj9jit29.so+0x1777f2])
ERR omrsig_protect+0x239 (0x00007F2B019F83C9 [libj9prt29.so+0x2a3c9])
ERR _Z21compilationThreadProcPv+0x17b (0x00007F2B00C8DBBB [libj9jit29.so+0x177bbb])
ERR thread_wrapper+0x163 (0x00007F2B019C03A3 [libj9thr29.so+0xb3a3])
ERR (0x00007F2B02185AC3 [libc.so.6+0x94ac3])
ERR clone+0x44 (0x00007F2B02216A04 [libc.so.6+0x125a04])
ERR ---------------------------------------
krismarc commented 2 weeks ago

I've opened it in both OpenJ9 and Semeru projects as it's not clear to me which one is the proper one. Once clarified we can clone one. https://github.com/eclipse-openj9/openj9/issues/20239 https://github.com/ibmruntimes/Semeru-Runtimes/issues/94

pshipton commented 2 weeks ago

@hzongaro pls take a look

hzongaro commented 2 weeks ago

@krismarc, do you have a core file and jitdump file you can share?

krismarc commented 2 weeks ago

@hzongaro, thx for quick response, how could I share it with you? I would prefer to not make it completely public. eg. your email or so.

Which one do you mean as core ? Javacore aka thread dump? https://www.ibm.com/docs/en/sdk-java-technology/8?topic=component-diagnostic-tools-data

We use this JVM to run Open Liberty. I get it by doing: server javadump defaultServer --include=thread or kill -3

I don't think I ever had to create a jitdump. Any quick guidance or article how to generate one? Meanwhile before you provide me any option for sharing files I will try to figure out this myself.

hzongaro commented 2 weeks ago

Which one do you mean as core ? Javacore aka thread dump?

I meant the System dump file - something named something like core.yyyymmdd.hhmmss.xxxx.0001.dmp

I would prefer to not make it completely public. eg. your email or so.

If you are able to upload it to Box, you can share a link with me privately via e-mail (see my profile) or in Slack https://openj9.slack.com/archives/DB23TAR50

I don't think I ever had to create a jitdump.

A jitdump file will usually be created if the JVM crashes during a JIT compilation, as was the case here, unless the default setting of the -Xdump option was overridden. Following the stack trace from the error log, do you see lines similar to the following?

JVMDUMP032I JVM requested System dump using '/somepath/core.20240925.171123.5897.0001.dmp' in response to an event
JVMDUMP010I System dump written to /somepath/core.20240925.171123.5897.0001.dmp
...
JVMDUMP051I JIT dump occurred in 'MainThread' thread 0x000000000E390E00
JVMDUMP010I JIT dump written to /somepath/jitdump.20240925.171123.5897.0004.dmp
JVMDUMP013I Processed dump event "gpf", detail "".
krismarc commented 2 weeks ago

If you are able to upload it to Box, you can share a link with me privately via e-mail (see my profile) or in Slack https://openj9.slack.com/archives/DB23TAR50

Slack is unfortunately blocked at our company (I will check what we can do here). However, we have internal solution for sharing bigger files. I could upload using it and you would receive the url via email. In separate email I would share a password for downloading from there. I sent to you a test one, so you can check if it's not blocked at IBM.

I meant the System dump file - something named something like core.yyyymmdd.hhmmss.xxxx.0001.dmp

I'm able to generate core dump on demand. Is this one also needed to be generated at the time of crash? or this one could be from anytime?

$ ~/app/.liberty/bin/server javadump defaultServer --include=system,heap,thread

Dumping server defaultServer.
Server defaultServer dump complete in /home/vcap/app/wlp/usr/servers/defaultServer/../../../../../dumps/javacore.20240926.144335.15.0001.txt.
Server defaultServer dump complete in /home/vcap/app/wlp/usr/servers/defaultServer/core.20240926.144335.15.0002.dmp.
Server defaultServer dump complete in /home/vcap/app/wlp/usr/servers/defaultServer/../../../../../dumps/heapdump.20240926.144337.15.0003.phd.

A jitdump file will usually be created if the JVM crashes during a JIT compilation, as was the case here, unless the default setting of the -Xdump option was overridden. Following the stack trace from the error log, do you see lines similar to the following?

It's containerized environment. Files do not survive the crash. I will try to trick it a bit, we have a script automatically uploading files to external storage in case of OutOfMemoryError events, I believe I could adjust it to gather required files (please just confirm if core needs to be from the crash event as well). We are using following options at the moment:

-Xdump:heap:defaults:file=../../../../../dumps/heapdump.%Y%m%d.%H%M%S.%pid.%seq.phd
-Xdump:java:defaults:file=../../../../../dumps/javacore.%Y%m%d.%H%M%S.%pid.%seq.txt
-Xdump:snap:defaults:file=../../../../../dumps/Snap.%Y%m%d.%H%M%S.%pid.%seq.trc
-Xdump:heap+java+snap:events=user
-Xdump:tool:events=systhrow,filter=java/lang/OutOfMemoryError,request=serial+exclusive,exec=../../../../.buildpack-diagnostics/killjava.sh

..any recommendations here? I think we are missing core one and jit dumps here. As it's mainly for OOMs so developer can troubleshoot memory leaks etc.

krismarc commented 2 weeks ago

Ok, I believe I will manage to gather all of them. It's all perfectly described here: https://www.ibm.com/docs/sl/sdk-java-technology/8?topic=options-xdump

..just let me know if propsoed way of sharing files would work for you.

pshipton commented 2 weeks ago

I'm able to generate core dump on demand. Is this one also needed to be generated at the time of crash? or this one could be from anytime?

It needs to be the one generated from the crash.

pshipton commented 2 weeks ago

Like you are doing -Xdump:heap:defaults:file=../../../../../dumps/heapdump.%Y%m%d.%H%M%S.%pid.%seq.phd you can also do -Xdump:system:defaults:file=../../../../../dumps/core.%Y%m%d.%H%M%S.%pid.%seq.dmp -Xdump:jit:defaults:file=../../../../../dumps/jitdump.%Y%m%d.%H%M%S.%pid.%seq.dmp

pshipton commented 2 weeks ago

For the file transfer, I'm sure we can make it work if the server is accessible outside your company. It's unlikely IBM would block it, or we could access from outside of IBM.

hzongaro commented 2 weeks ago

..just let me know if propsoed way of sharing files would work for you.

Yes, I was able to access the experimental file sharing link you sent earlier.

krismarc commented 1 week ago

Hi @pshipton @hzongaro

sorry for getting back to you so late. It took me a moment to cheat container orchestration to keep it alive when java process dies. Once, I've achieved that, I saw there's no enough disk space to save dumps.

Here the interesting part starts. Once I've increased the disk size, this became not reproducible anymore. After longer analyzes and attempts to reproduce this. The disk size was the only correlating reason for JVM to fail. Whenever, I've reduced the disk size, it was just a matter of time to get a failure again.

Here's a proof: image

With no load, it seems that it goes slowly 4 kilobytes down every several seconds.

I can easily reproduce this by allocating the filesystem with dummy files. image

Question if this is expected to simply crash? I am going to try the same with older openj9, if it's going to fail the same way. Maybe we were just lucky there was always some disk space left while using it. As we never experienced this before while the disk size was always the same.

Best regards, K.M.

pshipton commented 1 week ago

Bus error is often an indication of a disk space issue. I usually see it related to the shared cache, when the memory mapped shared memory can't be reflected to disk, but there is no indication to me this one is related to the shared cache. I hope Henry has some insights.

krismarc commented 1 week ago

I just tested with previous version. It handles the problem (it didn't crash), and there's a proper message.

ERR [err] !SESSION 2024-10-03 17:00:02.217 -----------------------------------------------
ERR eclipse.buildId=unknown
ERR java.fullversion=17.0.10+7
ERR JRE 17 Linux amd64-64-Bit Compressed References 20240116_670 (JIT enabled, AOT enabled)
ERR OpenJ9   - 2c3d78b48
ERR OMR      - ea8124dbc
ERR JCL      - 2aad089841f based on jdk-17.0.10+7
ERR BootLoader constants: OS=linux, ARCH=x86_64, WS=gtk, NL=en_US
ERR !ENTRY org.eclipse.osgi 4 0 2024-10-03 17:00:02.217
ERR !MESSAGE Error reading configuration: No space left on device
ERR !STACK 0
ERR java.io.IOException: No space left on device
ERR at java.base/java.io.File.createNewFile(Unknown Source)
ERR at org.eclipse.osgi.framework.internal.reliablefile.ReliableFile.createTempFile(ReliableFile.java:721)
ERR at org.eclipse.osgi.storagemanager.StorageManager.initializeInstanceFile(StorageManager.java:201)
ERR at org.eclipse.osgi.storagemanager.StorageManager.open(StorageManager.java:722)
ERR at org.eclipse.osgi.storage.Storage.getChildStorageManager(Storage.java:2194)
ERR at org.eclipse.osgi.storage.Storage.save0(Storage.java:1308)
ERR at org.eclipse.osgi.storage.Storage.save(Storage.java:1283)
ERR at org.eclipse.osgi.internal.framework.StorageSaver$StorageSaverTask.run(StorageSaver.java:31)
ERR at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
ERR at java.base/java.util.concurrent.FutureTask.runAndReset(Unknown Source)
ERR at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
ERR at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
ERR at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
ERR at java.base/java.lang.Thread.run(Unknown Source)
ERR [ERROR   ] CWWKE0701E: Error reading configuration: No space left on device java.io.IOException: No space left on device
ERR at java.base/java.io.File.createNewFile(Unknown Source)
ERR at org.eclipse.osgi.framework.internal.reliablefile.ReliableFile.createTempFile(ReliableFile.java:721)
ERR at [internal classes]
ERR
ERR [ERROR   ] CWWKE0701E: Error saving on update java.io.IOException: No space left on device
ERR at java.base/java.io.File.createNewFile(Unknown Source)
ERR at org.eclipse.osgi.framework.internal.reliablefile.ReliableFile.createTempFile(ReliableFile.java:721)
ERR at [internal classes]
ERR

image

krismarc commented 1 week ago

unfortunately, due to the nature of this problem, I can't really create dumps. The option for me would be to reproduce this locally and save dumps to another file system. However, I assume, due to lack of disk space in fs where jvm is executed this probably won't be possible anyway?

hzongaro commented 1 week ago

I haven't played with this at all, but you might look at setting /proc/sys/kernel/core_pattern to specify a different file system where the core file can be written, where %t specifies a time-stamp and %p the process ID:

echo '/myfs/core.%t.%p' | sudo tee /proc/sys/kernel/core_pattern
krismarc commented 1 week ago

I'll play around tomorrow. However, I assume it's worth trying to run anything using latest openj9 and mimic running out of disk space. In our case it's open liberty with spring boot application stared using it. I'd be not surprised that what we do in spring boot doesn't really matter. So there's a chance you could reproduce it yourself by running Open Liberty Spring's sample app.

krismarc commented 1 week ago

It's a blind guess. However, I can see some commits refering to memory management made by @mpirvu -> https://github.com/eclipse-openj9/openj9/commit/5b130020f29aae2a20194e6958d7850d2ac70719 -> https://github.com/eclipse-openj9/openj9/commit/a0467305e900ec6b12aff695167776bfe0aeb4d4 ..maybe it's somehow worth having this checked with him.

krismarc commented 1 week ago

I've just spinned docker container with OL's demo. Accordingly to this: https://openliberty.io/guides/spring-boot.html#building-and-running-the-application-in-a-docker-container

after allocating all the disk available within container and calling this app single time I got this:

2024-10-03 21:42:41 [AUDIT   ] CWWKE0085I: The server defaultServer is stopping because the JVM is exiting.
2024-10-03 21:42:41 [AUDIT   ] CWWKT0017I: Web application removed (default_host): http://73ba6b4373b5:9080/
2024-10-03 21:42:41 [AUDIT   ] CWWKE1100I: Waiting for up to 30 seconds for the server to quiesce.
2024-10-03 21:42:42 [AUDIT   ] CWWKZ0009I: The application thin-guide-spring-boot-0.1.0 has stopped successfully.
2024-10-03 21:42:42 [AUDIT   ] CWWKE0036I: The server defaultServer stopped after 5 minutes, 12.132 seconds.
2024-10-03 21:58:19 
2024-10-03 21:58:19 Unhandled exception
2024-10-03 21:58:19 Type=Bus error vmState=0x00000000
2024-10-03 21:58:19 J9Generic_Signal_Number=00000028 Signal_Number=00000007 Error_Value=00000000 Signal_Code=00000002
2024-10-03 21:58:19 Handler1=00007F9D5B26E660 Handler2=00007F9D5C8A1740 InaccessibleAddress=00007F9D46117000
2024-10-03 21:58:19 RDI=00007F9D461170F0 RSI=00007F9D54123E58 RAX=00007F9D46117000 RBX=00000000000000F0
2024-10-03 21:58:19 RCX=00007F9D541EB1D0 RDX=00007F9D54123E58 R8=0000000001210698 R9=0000000000000240
2024-10-03 21:58:19 R10=0000000000000000 R11=0000000000000000 R12=00007F9D54123E58 R13=00007F9D541DAC60
2024-10-03 21:58:19 R14=0000000000100000 R15=00007F9D5A99B704
2024-10-03 21:58:19 RIP=00007F9D59F07B77 GS=0000 FS=0000 RSP=00007F9D5C9DA210
2024-10-03 21:58:19 EFlags=0000000000010287 CS=0033 RBP=00007F9D541DAC40 ERR=0000000000000006
2024-10-03 21:58:19 TRAPNO=000000000000000E OLDMASK=0000000000000000 CR2=00007F9D46117000
2024-10-03 21:58:19 xmm0=5c957680b180db69 (f: 2978011904.000000, d: 9.984043e+137)
2024-10-03 21:58:19 xmm1=5c95768000000000 (f: 0.000000, d: 9.984038e+137)
2024-10-03 21:58:19 xmm2=00000000b7654321 (f: 3076866816.000000, d: 1.520174e-314)
2024-10-03 21:58:19 xmm3=0000000000100000 (f: 1048576.000000, d: 5.180654e-318)
2024-10-03 21:58:19 xmm4=00007f9d46117000 (f: 1175547904.000000, d: 6.932406e-310)
2024-10-03 21:58:19 xmm5=0000003000000020 (f: 32.000000, d: 1.018558e-312)
2024-10-03 21:58:19 xmm6=0000004400000044 (f: 68.000000, d: 1.442957e-312)
2024-10-03 21:58:19 xmm7=0000002000000000 (f: 0.000000, d: 6.790387e-313)
2024-10-03 21:58:19 xmm8=00007f9d5abdd9e0 (f: 1522391552.000000, d: 6.932423e-310)
2024-10-03 21:58:19 xmm9=00007f9d5abe6c60 (f: 1522429056.000000, d: 6.932423e-310)
2024-10-03 21:58:19 xmm10=3f3f3f3f3f3f3f3f (f: 1061109568.000000, d: 4.767923e-04)
2024-10-03 21:58:19 xmm11=9999999999999999 (f: 2576980480.000000, d: -2.353437e-185)
2024-10-03 21:58:19 xmm12=2020202020202020 (f: 538976256.000000, d: 6.013470e-154)
2024-10-03 21:58:19 xmm13=000001380000013f (f: 319.000000, d: 6.620627e-312)
2024-10-03 21:58:19 xmm14=0000000008001800 (f: 134223872.000000, d: 6.631540e-316)
2024-10-03 21:58:19 xmm15=0000000000000000 (f: 0.000000, d: 0.000000e+00)
2024-10-03 21:58:19 Module=/opt/java/openjdk/lib/default/libj9jit29.so
2024-10-03 21:58:19 Module_base_address=00007F9D59CC9000
2024-10-03 21:58:19 Target=2_90_20240716_840 (Linux 5.15.153.1-microsoft-standard-WSL2)
2024-10-03 21:58:19 CPU=amd64 (32 logical CPUs) (0x3e4dd5000 RAM)
2024-10-03 21:58:19 ----------- Stack Backtrace -----------
2024-10-03 21:58:19 _ZN2J919PersistentAllocator25allocateFromSegmentLockedEm+0x27 (0x00007F9D59F07B77 [libj9jit29.so+0x23eb77])
2024-10-03 21:58:19 _ZN2J919PersistentAllocator16allocateInternalEm+0x12e (0x00007F9D59F07ECE [libj9jit29.so+0x23eece])
2024-10-03 21:58:19 _ZN2J919PersistentAllocator8allocateEmSt9nothrow_tPv+0x9 (0x00007F9D59F08009 [libj9jit29.so+0x23f009])
2024-10-03 21:58:19 _ZN12TR_IProfiler8allocateEP11J9JITConfig+0x23 (0x00007F9D5A16A9D3 [libj9jit29.so+0x4a19d3])
2024-10-03 21:58:19 onLoadInternal+0xad1 (0x00007F9D59E687D1 [libj9jit29.so+0x19f7d1])
2024-10-03 21:58:19 J9VMDllMain+0x7e2 (0x00007F9D59E4A922 [libj9jit29.so+0x181922])
2024-10-03 21:58:19 runJ9VMDllMain+0xc9 (0x00007F9D5B282849 [libj9vm29.so+0x58849])
2024-10-03 21:58:19 pool_do+0x75 (0x00007F9D5B3DB195 [libj9vm29.so+0x1b1195])
2024-10-03 21:58:19 runInitializationStage+0x8e (0x00007F9D5B28505E [libj9vm29.so+0x5b05e])
2024-10-03 21:58:19 protectedInitializeJavaVM+0x2831 (0x00007F9D5B28EEE1 [libj9vm29.so+0x64ee1])
2024-10-03 21:58:19 omrsig_protect+0x239 (0x00007F9D5C8A23C9 [libj9prt29.so+0x2a3c9])
2024-10-03 21:58:19 initializeJavaVM+0x429 (0x00007F9D5B286A59 [libj9vm29.so+0x5ca59])
2024-10-03 21:58:19 J9_CreateJavaVM+0x72 (0x00007F9D5B279C02 [libj9vm29.so+0x4fc02])
2024-10-03 21:58:19 JNI_CreateJavaVM_impl+0x8b8 (0x00007F9D5C92B418 [libjvm.so+0x19418])
2024-10-03 21:58:19 JNI_CreateJavaVM+0x67f (0x00007F9D5C9EA13F [libjvm.so+0xd13f])
2024-10-03 21:58:19 JavaMain+0x8b (0x00007F9D5C9F8E1B [libjli.so+0x4e1b])
2024-10-03 21:58:19 ThreadJavaMain+0x9 (0x00007F9D5C9FD229 [libjli.so+0x9229])
2024-10-03 21:58:19 start_thread+0xea (0x00007F9D5C3AD1CA [libpthread.so.0+0x81ca])
2024-10-03 21:58:19 clone+0x43 (0x00007F9D5BE048D3 [libc.so.6+0x398d3])
2024-10-03 21:58:19 ---------------------------------------
2024-10-03 21:58:19 JVMDUMP039I Processing dump event "gpf", detail "" at 2024/10/03 19:58:19 - please wait.
2024-10-03 21:58:19 JVMDUMP032I JVM requested System dump using '/opt/ol/wlp/output/defaultServer/core.20241003.195819.1.0001.dmp' in response to an event
2024-10-03 21:58:25 JVMDUMP012E Error in System dump: The core file created by child process with pid = 30 was not found.  Expected to find core file with name "/mnt/wslg/dumps/core.java.30"
2024-10-03 21:58:25 JVMDUMP032I JVM requested Java dump using '/opt/ol/wlp/output/defaultServer/javacore.20241003.195819.1.0002.txt' in response to an event
2024-10-03 21:58:25 JVMDUMP012E Error in Java dump: /opt/ol/wlp/output/defaultServer/javacore.20241003.195819.1.0002.txt
2024-10-03 21:58:25 JVMDUMP032I JVM requested Snap dump using '/opt/ol/wlp/output/defaultServer/Snap.20241003.195819.1.0003.trc' in response to an event
2024-10-03 21:58:25 JVMDUMP010I Snap dump written to /opt/ol/wlp/output/defaultServer/Snap.20241003.195819.1.0003.trc
2024-10-03 21:58:25 JVMDUMP032I JVM requested JIT dump using '/opt/ol/wlp/output/defaultServer/jitdump.20241003.195819.1.0004.dmp' in response to an event
2024-10-03 21:58:25 JVMDUMP013I Processed dump event "gpf", detail "".
pshipton commented 1 week ago

Moved to the 0.48 milestone.

hzongaro commented 1 week ago

@krismarc, thanks for tracking down those commits. May I ask you to see whether running with the option -Xjit:disableIprofilerDataDisclaiming avoids the problem?

hzongaro commented 1 week ago

@cjjdespres, may I ask you to take a look at this problem?

krismarc commented 1 week ago

I'll try it and let you know. Meanwhile, I've got another observations.

I've tried with 2 different types of Java applications.

1) While testing with new openj9, I can see that any REST call done against application causes some disk consumption and finally results in crash. That's not observed with 0.45. I can keep calling the app even if disk remains 0, no crashes, no errors.

2) The one I've been testing since beginning. Calls cause some disk consumption, when it comes to 0 the version 0.45 handles it with no disk space left errors. 0.46+ results in errors we are taking up here in this issue.

krismarc commented 1 week ago

@krismarc, thanks for tracking down those commits. May I ask you to see whether running with the option -Xjit:disableIprofilerDataDisclaiming avoids the problem?

no changes, I could reproduce it in the same as I've just described

cjjdespres commented 1 week ago

I'll take a look. @mpirvu might be available to take a look at some point after this week as well. (EDIT: I misremembered, and he's back on Oct 10, not that it should matter).

Just from the backtrace in https://github.com/eclipse-openj9/openj9/issues/20239#issuecomment-2392243873, it looks like it's crashing right when the IProfiler is being started. I believe that after https://github.com/eclipse-openj9/openj9/pull/19355 the IProfiler has had a dedicated memory allocator that it will use for memory allocation (instead of using jitPersistentAlloc()) regardless of whether or not -Xjit:disableIprofilerDataDisclaiming is set. If it's some problem with the creation of the memory allocator or some behavioural difference between using PersistentAllocator::allocate(size_t size, const std::nothrow_t tag, void * hint) with the dedicated allocator and just using jitPersistentAlloc, then that option wouldn't help. That option does influence some of the properties of the allocator that's created, though.

cjjdespres commented 1 week ago

I can reproduce this if I follow that spring boot guide, though I happened to do it by sticking guide-spring-boot in a small-ish docker volume that I could then fill up after I'd built the server using these instructions.

If I just run java -jar target/guide-spring-boot-0.1.0.jar I get the same crash as in https://github.com/eclipse-openj9/openj9/issues/20239#issuecomment-2392243873. If I run java -Xjit:disableIprofilerDataDisclaiming -jar target/guide-spring-boot-0.1.0.jar I appear to get a crash in a different place:

Unhandled exception
Type=Bus error vmState=0x0005ffff
J9Generic_Signal_Number=00000028 Signal_Number=00000007 Error_Value=00000000 Signal_Code=00000002
Handler1=00007FDA659B83E0 Handler2=00007FDA65827B40 InaccessibleAddress=00007FDA32D00000
RDI=00007FDA6013C090 RSI=0000000000200000 RAX=00007FDA32D00008 RBX=0000000000200000
RCX=00007FDA65E0542B RDX=00007FDA6012C538 R8=00007FDA32D00000 R9=0000000000000000
R10=00007FFED498A080 R11=0000000000000246 R12=00007FDA32D00000 R13=00007FDA4C008150
R14=0000000000000001 R15=0000000000000140
RIP=00007FDA64B27AB4 GS=0000 FS=0000 RSP=00007FDA51A8AEF0
EFlags=0000000000010246 CS=0033 RBP=00007FDA6013C090 ERR=0000000000000006
TRAPNO=000000000000000E OLDMASK=0000000000000000 CR2=00007FDA32D00000
xmm0=00007fda32d00008 (f: 852492288.000000, d: 6.945334e-310)
xmm1=5858585858585f37 (f: 1482186496.000000, d: 3.836983e+117)
xmm2=0000585858585858 (f: 1482184832.000000, d: 4.799179e-310)
xmm3=0000000000200000 (f: 2097152.000000, d: 1.036131e-317)
xmm4=00007fda32d00000 (f: 852492288.000000, d: 6.945334e-310)
xmm5=0000003000000020 (f: 32.000000, d: 1.018558e-312)
xmm6=00007fda50992bf0 (f: 1352215552.000000, d: 6.945359e-310)
xmm7=0000000000000002 (f: 2.000000, d: 9.881313e-324)
xmm8=00007fda653a0c90 (f: 1698303104.000000, d: 6.945376e-310)
xmm9=00007fda653aa0a0 (f: 1698340992.000000, d: 6.945376e-310)
xmm10=3f3f3f3f3f3f3f3f (f: 1061109568.000000, d: 4.767923e-04)
xmm11=9999999999999999 (f: 2576980480.000000, d: -2.353437e-185)
xmm12=2020202020202020 (f: 538976256.000000, d: 6.013470e-154)
xmm13=000001380000013f (f: 319.000000, d: 6.620627e-312)
xmm14=0000000008001800 (f: 134223872.000000, d: 6.631540e-316)
xmm15=0000000000000000 (f: 0.000000, d: 0.000000e+00)
Module=/opt/java/openjdk/lib/default/libj9jit29.so
Module_base_address=00007FDA6467F000

Method_being_compiled=java/lang/Object.<init>()V
Target=2_90_20240906_000000 (Linux 5.14.0-427.35.1.el9_4.x86_64)
CPU=amd64 (8 logical CPUs) (0x3d781c000 RAM)
----------- Stack Backtrace -----------
_ZN19TR_DataCacheManager28convertDataCacheToAllocationEP12TR_DataCache+0x44 (0x00007FDA64B27AB4 [libj9jit29.so+0x4a8ab4])
_ZN19TR_DataCacheManager23allocateDataCacheRecordEjjPj+0x16b (0x00007FDA64B2802B [libj9jit29.so+0x4a902b])
_ZN31TR_SharedCacheRelocationRuntime24allocateSpaceInDataCacheEmm+0x28 (0x00007FDA64B5EFC8 [libj9jit29.so+0x4dffc8])
_ZN20TR_RelocationRuntime29prepareRelocateAOTCodeAndDataEP10J9VMThreadP11TR_FrontEndPN2TR9CodeCacheEPK20J9JITDataCacheHeaderP8J9MethodbPNS4_7OptionsEPNS4_11CompilationEP17TR_ResolvedMethodPhP16TR_J9SharedCache+0x510 (0x00007FDA64B60000 [libj9jit29.so+0x4e1000])
_ZN2TR28CompilationInfoPerThreadBase22installAotCachedMethodEP10J9VMThreadPKvP8J9MethodP11TR_FrontEndPNS_7OptionsEP17TR_ResolvedMethodP21TR_MethodToBeCompiledPNS_11CompilationE+0x91 (0x00007FDA647F8611 [libj9jit29.so+0x179611])
_ZN2TR28CompilationInfoPerThreadBase14performAOTLoadEP10J9VMThreadPNS_11CompilationEP17TR_ResolvedMethodP11TR_J9VMBaseP8J9Method+0x91 (0x00007FDA647FB6E1 [libj9jit29.so+0x17c6e1])
_ZN2TR28CompilationInfoPerThreadBase7compileEP10J9VMThreadPNS_11CompilationEP17TR_ResolvedMethodR11TR_J9VMBaseP19TR_OptimizationPlanRKNS_16SegmentAllocatorE+0xac5 (0x00007FDA647FC4C5 [libj9jit29.so+0x17d4c5])
_ZN2TR28CompilationInfoPerThreadBase14wrappedCompileEP13J9PortLibraryPv+0x385 (0x00007FDA647FCEE5 [libj9jit29.so+0x17dee5])
omrsig_protect+0x241 (0x00007FDA658287F1 [libj9prt29.so+0x2a7f1])
_ZN2TR28CompilationInfoPerThreadBase7compileEP10J9VMThreadP21TR_MethodToBeCompiledRN2J917J9SegmentProviderE+0x395 (0x00007FDA647FAA05 [libj9jit29.so+0x17ba05])
_ZN2TR24CompilationInfoPerThread12processEntryER21TR_MethodToBeCompiledRN2J917J9SegmentProviderE+0x12c (0x00007FDA647FAD6C [libj9jit29.so+0x17bd6c])
_ZN2TR24CompilationInfoPerThread14processEntriesEv+0x377 (0x00007FDA647F9C37 [libj9jit29.so+0x17ac37])
_ZN2TR24CompilationInfoPerThread3runEv+0x42 (0x00007FDA647F9FA2 [libj9jit29.so+0x17afa2])
_Z30protectedCompilationThreadProcP13J9PortLibraryPN2TR24CompilationInfoPerThreadE+0x86 (0x00007FDA647FA056 [libj9jit29.so+0x17b056])
omrsig_protect+0x241 (0x00007FDA658287F1 [libj9prt29.so+0x2a7f1])
_Z21compilationThreadProcPv+0x17f (0x00007FDA647FA42F [libj9jit29.so+0x17b42f])
thread_wrapper+0x187 (0x00007FDA6595CC07 [libj9thr29.so+0xbc07])
start_thread+0x2d2 (0x00007FDA65D87C02 [libc.so.6+0x89c02])
__clone3+0x30 (0x00007FDA65E0CC40 [libc.so.6+0x10ec40])

and if I run java -Xjit:disableIprofilerDataDisclaiming,disableDataCacheDisclaiming -jar target/guide-spring-boot-0.1.0.jar I get no crash at all.

I'm pretty sure what's happening is that if disclaiming is enabled (in IProfiler, data cache, probably code cache as well) the MEMORY_TYPE_VIRTUAL flag gets set in the memory segment type we're requesting. Since -Xjit:disclaimMemoryOnSwap is disabled by default, the flag MEMORY_TYPE_DISCLAIMABLE_TO_FILE is also set. With these enabled, the segment ultimately comes from reserve_memory_with_mmap, which will end up creating a temporary file that then gets mapped into memory.

The manual entries for mkostemp, ftruncate, and mmap do not appear to exclude the possibility that mkostemp successfully created file descriptor, ftruncate silently failed to set the file to the correct size, and mmap sucessfully mapped that to some region of memory. I haven't traced the exact execution, but that seems plausible to me. Indeed, the mmap manual says:

       Use of a mapped region can result in these signals:

       SIGBUS Attempted access to a page of the buffer that lies beyond
              the end of the mapped file.  For an explanation of the
              treatment of the bytes in the page that corresponds to the
              end of a mapped file that is not a multiple of the page
              size, see NOTES.

A related pull request is https://github.com/eclipse/omr/pull/7262, which modified how reserve_memory_with_mmap works to support disclaiming file-backed memory to temporary files.

The error also goes away if I use -Xjit:disclaimMemoryOnSwap, which will cause this memory to be backed by swap instead and so avoid this particular code path.

cjjdespres commented 1 week ago

If that is the case, then that omr code might need to be changed. Maybe we need to fsat the file after ftruncate just to be sure that it's actually of sufficient size for mmap.

cjjdespres commented 1 week ago

Incidentally, this can be reproduced even by running java -version while in a directory on a file system that is completely full. A docker volume, anyway, in case that becomes relevant.

cjjdespres commented 1 week ago

Stepping through execution does confirm that something like that sequence of events occurs. Unfortunately, adding an fstat check does not help. What happens is:

  1. A file descriptor fd is opened successfully with mkostemp.
  2. The ftruncate call succeeds.
  3. A subsequent fstat that I added also succeeds, and reports that the file has the correct size.
  4. The file is mapped to memory with mmap. The address returned points to inaccessible memory.

The address we get becomes the heapBase of a J9MemorySegment stored in the PersistentAllocator created with (MEMORY_TYPE_VIRTUAL | MEMORY_TYPE_DISCLAIMABLE_TO_FILE). We then crash when trying to allocate a Block using that segment.

I also added an fstatfs(fd, &buf) call just before the ftruncate, and that does correctly report that the file system in which the temporary file is created has a block size of 4096 and 0 free blocks. Maybe that function could be used to make sure there is enough space before attempting to ftruncate and mmap.

cjjdespres commented 1 week ago

There's also fallocate() that seems to do what we want here. I'll test that.

Unrelated, but using open() with the O_TMPFILE flag might be slightly better than doing mkostmp followed by unlink, but that's minor.

hzongaro commented 5 days ago

As we're late in the schedule for 0.48, we're going to have to move this out to 0.49.

hzongaro commented 5 days ago

@cjjdespres, will your proposed change in eclipse/omr#7484 also address the observation made in an earlier comment?

  1. While testing with new openj9, I can see that any REST call done against application causes some disk consumption and finally results in crash. That's not observed with 0.45. I can keep calling the app even if disk remains 0, no crashes, no errors.
mpirvu commented 5 days ago

The default behavior of the data cache and IProfiler segment disclaiming is to use some disk space. A REST call may cause the JIT compiler to compile new methods and allocate new data caches backed by files. If the intent is to be able to run the JVM when there is little or no disk space available, then we have to change the default to rely on swap space (if present).

krismarc commented 2 days ago

The default behavior of the data cache and IProfiler segment disclaiming is to use some disk space. A REST call may cause the JIT compiler to compile new methods and allocate new data caches backed by files. If the intent is to be able to run the JVM when there is little or no disk space available, then we have to change the default to rely on swap space (if present).

I'm a bit confused. Does it mean, it's some sort of new JVM feature? If so, we tried this -Xjit:disableIprofilerDataDisclaiming is supposed to disable the feature, so JVM should behave as before? While it yet crashes, which was not a case in previous versions.

mpirvu commented 1 day ago

Yes it is a new JVM feature that should reduce the memory footprint associated with interpreter profiling and data cache. To fully disable this you need -Xjit:disableDataCacheDisclaiming,disableIprofilerDataDisclaiming

As an alternative you can use -Xjit:disclaimMemoryOnSwap. This will keep the feature enabled, but use swap space instead of temporary files. The catch is that, if swap is not available on your system, the JVM will revert to using temporary files.

@krismarc If we send you a prototype JVM with the proposed fix, would you be able to try it?

cjjdespres commented 1 day ago

@cjjdespres, will your proposed change in eclipse/omr#7484 also address the observation made in an earlier comment?

That I have not tested. If the extra disk consumption is caused by the file-backed memory of the data cache/IProfiler, then yes, it should be fixed. I can test the sample spring boot application to see if I can replicate that behaviour.

krismarc commented 1 day ago

Yes it is a new JVM feature that should reduce the memory footprint associated with interpreter profiling and data cache. To fully disable this you need -Xjit:disableDataCacheDisclaiming,disableIprofilerDataDisclaiming

As an alternative you can use -Xjit:disclaimMemoryOnSwap. This will keep the feature enabled, but use swap space instead of temporary files. The catch is that, if swap is not available on your system, the JVM will revert to using temporary files.

@krismarc If we send you a prototype JVM with the proposed fix, would you be able to try it?

Thx for details. Do we have any details/percentage of much memory we save by this feature? I do believe the answer would be 'it depends' however, question is rather if it's something noticeable.

I assume, @cjjdespres has already an option to reproduce this himself as mentioned in this comment: https://github.com/eclipse-openj9/openj9/issues/20239#issuecomment-2394319288

..anyway, I could give it a try myself too.

cjjdespres commented 1 day ago

I think I can reproduce it, yes. Before https://github.com/eclipse/omr/pull/7484, the filesystem usage would slowly increase, presumably because we were filling in the data cache or IProfiler segments with data, and that was dynamically increasing the actual file size on disk. I didn't check, but I imagine there would be a SIGBUS crash if the filesystem filled up and the underlying temp file couldn't be expanded.

With the change, the filesystem usage now increases in larger jumps, because we make sure to reserve the space we want for our segments in the filesystem right away. There aren't many of these jumps, since the spring boot application is small and there isn't much IProfiler or compilation activity that needs to take place. If there is not enough space in the underlying filesystem then there is no crash - we avoid backing segments with a temp file in these circumstances, and fall back to other means of memory allocation.

github-actions[bot] commented 1 day ago

Issue Number: 20239 Status: Closed Actual Components: comp:jit, userRaised Actual Assignees: No one :( PR Assignees: No one :(