eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 721 forks source link

renaissance StackOverflowError #14455

Closed JasonFengJ9 closed 2 years ago

JasonFengJ9 commented 2 years ago

Failure link

From an internal build job/Test_openjdk8_j9_extended.perf_aarch64_linux/44/(cent7-aarch64-6):

openjdk version "1.8.0_322"
IBM Semeru Runtime Open Edition (build 1.8.0_322-b06)
Eclipse OpenJ9 VM (build master-88c97662c, JRE 1.8.0 Linux aarch64-64-Bit Compressed References 20220205_289 (JIT enabled, AOT enabled)
OpenJ9   - 88c97662c
OMR      - 96a484d1b
JCL      - cc0e9e2be based on jdk8u322-b06)

Rerun in Grinder - Change TARGET to run only the failed test targets.

Optional info

Failure output (captured from console output)

[2022-02-05T04:44:06.086Z] variation: NoOptions
[2022-02-05T04:44:06.086Z] JVM_OPTIONS:  

[2022-02-05T04:47:19.659Z] ====== als (apache-spark), iteration 24 started ======
[2022-02-05T04:47:25.646Z] Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError
[2022-02-05T04:47:25.647Z]  at scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
[2022-02-05T04:47:25.647Z]  at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
[2022-02-05T04:47:25.647Z]  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[2022-02-05T04:47:25.647Z]  at java.lang.reflect.Method.invoke(Method.java:498)
[2022-02-05T04:47:25.647Z]  at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1154)
[2022-02-05T04:47:25.647Z]  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
[2022-02-05T04:47:25.647Z]  at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
[2022-02-05T04:47:25.647Z]  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
[2022-02-05T04:47:25.647Z]  at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
[2022-02-05T04:47:25.647Z]  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
[2022-02-05T04:47:25.647Z]  at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
[2022-02-05T04:47:25.647Z]  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
[2022-02-05T04:47:25.647Z]  at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
[2022-02-05T04:47:25.647Z]  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
[2022-02-05T04:47:25.647Z]  at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
[2022-02-05T04:47:25.647Z]  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
[2022-02-05T04:47:25.647Z]  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
[2022-02-05T04:47:25.647Z]  at scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
[2022-02-05T04:47:25.647Z]  at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
[2022-02-05T04:47:25.647Z]  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[2022-02-05T04:47:25.647Z]  at java.lang.reflect.Method.invoke(Method.java:498)
[2022-02-05T04:47:25.647Z]  at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1154)

[2022-02-05T04:47:25.671Z]  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
[2022-02-05T04:47:25.671Z]  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
[2022-02-05T04:47:25.671Z]  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Cancelling nested steps due to timeout
[2022-02-06T00:40:35.242Z] Sending interrupt signal to process
[2022-02-06T00:40:44.271Z] 
[2022-02-06T00:40:44.271Z] renaissance-als_0_FAILED

This is JDK8 head stream, appears aarch64 specific unlike https://github.com/eclipse-openj9/openj9/issues/12219. fyi @tajila

pshipton commented 2 years ago

@knn-k fyi

knn-k commented 2 years ago

I ran a 10x Grinder job at internal job/Grinder/20821/tapResults/. No failures.

knn-k commented 2 years ago

Reproduced in another Grinder job: job/Grinder/20822/consoleText

tajila commented 2 years ago

@mikezhang1234567890 can you take a look at this. We have a 256K stack size for aarch64, can you see if a 512k default size solves this?

mikezhang1234567890 commented 2 years ago

I'll rerun the grinder with -Xmso512K to see if the error still occurs.

mikezhang1234567890 commented 2 years ago

Had a stack overflow failure with -Xmso512K, in internal Jenkins: /job/Grinder/20844/consoleText

mikezhang1234567890 commented 2 years ago

Looks like the Jenkins job doesn't pass the command line arguments to the perf test. Will retry by creating a new build with default set to 512K.

mikezhang1234567890 commented 2 years ago

Using #define J9_OS_STACK_SIZE (1024 * 1024) a 40x grinder ran internally and had no failures at job/Grinder/20902/

Using 512k did have a stack overflow.

@tajila should I make the change to set J9_OS_STACK_SIZE to 1024K for aarch64?

tajila commented 2 years ago

@knn-k Is there a reason why aarch64 would require larger stack space?

pshipton commented 2 years ago

Only z/OS has such a large default size, as I recall it's because the OS provides that size even if we specify a smaller one. https://www.eclipse.org/openj9/docs/openj9_defaults/

knn-k commented 2 years ago

@tajila I have no idea on the reason for large stack on AArch64. I don't understand why this failure occurs intermittently, either.

tajila commented 2 years ago

@knn-k

Here are the volatile register we save on x86

    rax
    rcx
    rdx
        //RSI not volatile in C but is a JIT helper argument register
    rsi
    r8
    r9
    r10
    r11
    xmm0
    xmm1
    xmm2
    xmm3
    xmm4
    xmm5
//total (8*8) + (6*16) = 160bytes

on aarch64 its

x0 ... x18
d16 ... d31
//total (19*8) + (16*8) = 268bytes

So this would warrant a larger native stack than x86, but 1M seems extreme.

tajila commented 2 years ago

@mikezhang1234567890 do you have a core file from when it SO's? it would be interesting to see the !stacklsots [thread] for the thread that overflows. Know how many jit or native transitions are performed would help us identify the cause

mikezhang1234567890 commented 2 years ago

No core yet, I'll run another grinder and get one.

mikezhang1234567890 commented 2 years ago

stackoverflow_stackslots.txt

@tajila I've attached the output for !stackslots and !j9vmthread on the thread that gets the stack overflow.

tajila commented 2 years ago

Based on that output it looks like a Java stack overflow rather than a OS stack overflow. @mikezhang1234567890 can you try it with a larger -Xss, keeping the original default OS stack size

mikezhang1234567890 commented 2 years ago

Since https://www.eclipse.org/openj9/docs/openj9_defaults/ says the default is 1024K, I tried -Xss2M and didn't get a stack overflow error in 40 runs.

tajila commented 2 years ago

After looking at the diagnostics in https://github.com/eclipse-openj9/openj9/issues/14455#issuecomment-1040474440 it appears that there are a lot of arrays pinned on the java stack. This contributes to a lot of the java stack space usage.

I think someone from the JIT team should look at whether this is happeing from frequently on aarch64 as opposed to other platforms.

@0xdaryl

Akira1Saitoh commented 2 years ago

Looking at the jitdump, I found java/io/ObjectOutputStream.writeObject0(Ljava/lang/Object;Z)V had 178 locals and stack frame size of the method was 1552 bytes. The method was called recursively 489 times according to javacore. I am not sure if this method has many locals as AArch64 on other platforms.

AArch64 does not implement locals compaction[1] and allocate 8-byte space even for 4-byte locals. It seems that x and z do locals compaction and spend 4 bytes for 4-byte locals. I think Power does the same as AArch64 because we referred to the power code when implementing AArch64 codegen.

[1] https://github.com/eclipse-openj9/openj9/issues/5910

localsInfoWriteObject0.txt

knn-k commented 2 years ago

Locals compaction has been implemented and enabled on AArch64 by https://github.com/eclipse/omr/pull/6387 and related PRs. Thanks to @Akira1Saitoh

pshipton commented 2 years ago

Locals compaction has been implemented and enabled on AArch64

0xdaryl commented 2 years ago

I'd like this change to soak in master for a while. Bugs here can be kind of subtle. I think it can wait for 0.33.

Akira1Saitoh commented 2 years ago

Are all the changes merged in the head stream? The OMR change is promoted, not sure what the other PRs are.

All the changes have been merged in the head stream.

Does locals compaction resolve this issue?

50x Grinder run passed with the personal build of the head stream. (/job/Grinder/21693/)

pshipton commented 2 years ago

Closing since the problem is resolved.