eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 721 forks source link

JDK 18/19 jdk_vector_0_FAILED Unexpected exit from test [exit code: 137] #15268

Closed JasonFengJ9 closed 1 year ago

JasonFengJ9 commented 2 years ago

Failure link

From an internal build job/Test_openjdknext_j9_extended.openjdk_aarch64_linux_Personal/1/consoleFull(ub18-aarch64-7):

11:36:29  openjdk version "19-internal" 2022-09-20
11:36:29  OpenJDK Runtime Environment (build 19-internal-adhoc.jenkins.BuildJDKnextaarch64linuxPersonal)
11:36:29  Eclipse OpenJ9 VM (build exclude19-52f04efbff5, JRE 19 Linux aarch64-64-Bit Compressed References 20220607_35 (JIT enabled, AOT enabled)
11:36:29  OpenJ9   - 52f04efbff5
11:36:29  OMR      - c60867497c6
11:36:29  JCL      - 5ccf02de16a based on jdk-19+25)

Rerun in Grinder - Change TARGET to run only the failed test targets.

Optional info

Failure output (captured from console output)

14:55:40  ===============================================
14:55:40  Running test jdk_vector_0 ...
14:55:40  ===============================================
14:55:40  jdk_vector_0 Start Time: Tue Jun  7 14:55:40 2022 Epoch Time (ms): 1654628140115
14:55:40  "/home/jenkins/workspace/Test_openjdknext_j9_extended.openjdk_aarch64_linux_Personal/openjdkbinary/j2sdk-image/bin/java" -Xshareclasses:destroyAll; "/home/jenkins/workspace/Test_openjdknext_j9_extended.openjdk_aarch64_linux_Personal/openjdkbinary/j2sdk-image/bin/java" -Xshareclasses:groupAccess,destroyAll; echo "cache cleanup done";
14:55:40  JVMSHRC005I No shared class caches available
14:55:40  JVMSHRC005I No shared class caches available
14:55:40  cache cleanup done
14:55:40  variation: Mode150
14:55:40  JVM_OPTIONS:  -XX:+UseCompressedOops 

15:00:05  TEST: jdk/incubator/vector/Byte128VectorTests.java

15:00:05  TEST: jdk/incubator/vector/Byte256VectorTests.java

15:11:53  TEST: jdk/incubator/vector/Int128VectorTests.java

15:23:13  TEST: jdk/incubator/vector/Short256VectorTests.java

15:23:13  TEST RESULT: Failed. Unexpected exit from test [exit code: 137]

15:23:13  --------------------------------------------------
15:33:12  Test results: passed: 67; failed: 4

15:33:31  jdk_vector_0_FAILED
pshipton commented 2 years ago

@gita-omr

gita-omr commented 2 years ago

Looking.

JasonFengJ9 commented 2 years ago

An internal build(cent7-aarch64-6)

openjdk version "18.0.1.1" 2022-04-22
IBM Semeru Runtime Open Edition 18.0.2.0-m2 (build 18.0.1.1+2)
Eclipse OpenJ9 VM 18.0.2.0-m2 (build master-8ead00aa6, JRE 18 Linux aarch64-64-Bit Compressed References 20220714_19 (JIT enabled, AOT enabled)
OpenJ9   - 8ead00aa6
OMR      - 29756eb80
JCL      - c0b6f04919 based on jdk-18.0.1.1+2)

[2022-07-14T23:45:13.370Z] variation: Mode150
[2022-07-14T23:45:13.370Z] JVM_OPTIONS:  -XX:+UseCompressedOops 

[2022-07-15T00:13:52.837Z] TEST RESULT: Failed. Unexpected exit from test [exit code: 137]
[2022-07-15T00:13:52.837Z] --------------------------------------------------
[2022-07-15T00:22:18.492Z] Test results: passed: 71; failed: 1
[2022-07-15T00:22:30.546Z] Report written to /home/jenkins/workspace/Test_openjdk18_j9_extended.openjdk_aarch64_linux/jvmtest/openjdk/report/html/report.html
[2022-07-15T00:22:30.546Z] Results written to /home/jenkins/workspace/Test_openjdk18_j9_extended.openjdk_aarch64_linux/aqa-tests/TKG/output_16578313613568/jdk_vector_0/work
[2022-07-15T00:22:30.546Z] Error: Some tests failed or other problems occurred.
[2022-07-15T00:22:30.546Z] 
[2022-07-15T00:22:30.546Z] jdk_vector_0_FAILED
knn-k commented 2 years ago

FYI @Akira1Saitoh

JasonFengJ9 commented 2 years ago

An internal build(ub18-aarch64-8)

[2022-08-04T22:30:39.700Z] variation: Mode150
[2022-08-04T22:30:39.700Z] JVM_OPTIONS:  -XX:+UseCompressedOops 

[2022-08-04T22:50:43.793Z] TEST: jdk/incubator/vector/Long128VectorTests.java
[2022-08-04T22:50:43.806Z] TEST RESULT: Failed. Unexpected exit from test [exit code: 137]
[2022-08-04T22:50:43.806Z] --------------------------------------------------
[2022-08-04T22:52:40.943Z] TEST: jdk/incubator/vector/LongMaxVectorTests.java

[2022-08-04T22:52:40.951Z] TEST RESULT: Failed. Unexpected exit from test [exit code: 137]
[2022-08-04T22:52:40.951Z] --------------------------------------------------
[2022-08-04T22:58:11.608Z] TEST: jdk/incubator/vector/ShortMaxVectorTests.java
[2022-08-04T22:58:11.614Z] TEST RESULT: Failed. Unexpected exit from test [exit code: 137]
[2022-08-04T22:58:11.614Z] --------------------------------------------------
[2022-08-04T23:06:28.986Z] Test results: passed: 69; failed: 3
[2022-08-04T23:06:34.769Z] Report written to /home/jenkins/workspace/Test_openjdk18_j9_extended.openjdk_aarch64_linux/jvmtest/openjdk/report/html/report.html
[2022-08-04T23:06:34.769Z] Results written to /home/jenkins/workspace/Test_openjdk18_j9_extended.openjdk_aarch64_linux/aqa-tests/TKG/output_16596423323148/jdk_vector_0/work
[2022-08-04T23:06:34.769Z] Error: Some tests failed or other problems occurred.
[2022-08-04T23:06:34.769Z] 
[2022-08-04T23:06:34.769Z] jdk_vector_0_FAILED
tajila commented 2 years ago

@knn-k any update on this?

Akira1Saitoh commented 2 years ago

It seems that Linux kernel OOM killer terminates test java process of jdk_vector_0. A certain test spends more than 4GB of memory and multiple tests are running in parallel. Hence, it uses up the system memory (8GB in this case), resulting in OOM killer. Below is the excerpt from dmesg on an internal machine (ub18-aarch64).

[3340113.392290] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[3340113.392296] [  498]     0   498    68302     2111   540672     1553             0 systemd-journal
[3340113.392298] [  507]     0   507    21532        2    53248       66             0 lvmetad
[3340113.392300] [  517]     0   517     4212      116    57344      205         -1000 systemd-udevd
[3340113.392302] [  667]   100   667     2842       29    61440      127             0 systemd-network
[3340113.392303] [  685]     0   685     1631       22    57344      110             0 rpcbind
[3340113.392305] [  794]     0   794    96581      106   110592      347             0 udisksd
[3340113.392306] [  801]   103   801     6205      160    90112      329          -900 dbus-daemon
[3340113.392308] [  845]     0   845      912        8    49152       45             0 atd
[3340113.392309] [  851]     0   851    20096       37    53248       61             0 irqbalance
[3340113.392311] [  858]     0   858   166867        0   126976      249             0 lxcfs
[3340113.392313] [  864]     0   864    26124        2    98304     2042             0 networkd-dispat
[3340113.392314] [  866]     0   866     2675       27    57344       46             0 cron
[3340113.392316] [  875]     0   875    59565       49    98304      230             0 accounts-daemon
[3340113.392318] [  876]   102   876    59380      133   106496      591             0 rsyslogd
[3340113.392319] [  889]     0   889    57868      100    81920      146             0 polkitd
[3340113.392321] [  899]   101   899     2620       21    61440      161             0 systemd-resolve
[3340113.392323] [ 1121]     0  1121   444036     1599   364544     4125          -999 containerd
[3340113.392324] [ 1138]   111  1138    22000       35    86016      329             0 ntpd
[3340113.392326] [ 1143]     0  1143     2605        2    61440      181         -1000 sshd
[3340113.392328] [ 1310]     0  1310   365408     2703   442368     7434          -500 dockerd
[3340113.392329] [ 1518]     0  1518     2335        0    53248       34             0 agetty
[3340113.392331] [ 1532]     0  1532     2391        0    53248       35             0 agetty
[3340113.392333] [ 1002]     0  1002     3090        0    61440      283             0 systemd
[3340113.392334] [ 1003]     0  1003    43283        1   110592      746             0 (sd-pam)
[3340113.392337] [15684]     0 15684     6974        2    94208      475             0 sshd
[3340113.392339] [15751]  1000 15751     8127       51   106496     1581             0 sshd
[3340113.392340] [15806]  1000 15806  1557255     9367   819200    27520             0 java
[3340113.392342] [21187]  1000 21187   321581        0   368640     2606             0 Xvfb
[3340113.392344] [21230]  1000 21230      480        5    40960       20             0 sh
[3340113.392346] [21231]  1000 21231      480       19    40960       10             0 sh
[3340113.392347] [21232]  1000 21232      480        0    36864       26             0 sh
[3340113.392348] [21238]  1000 21238     2311        2    53248       86             0 make
[3340113.392350] [21284]  1000 21284     2340        2    53248      117             0 make
[3340113.392351] [21388]  1000 21388      480        0    36864       25             0 sh
[3340113.392353] [21389]  1000 21389     2339        1    53248      119             0 make
[3340113.392354] [21391]  1000 21391      480        0    45056       25             0 sh
[3340113.392355] [21392]  1000 21392     2342        2    49152      127             0 make
[3340113.392357] [21451]  1000 21451      480        0    45056       27             0 sh
[3340113.392358] [21452]  1000 21452      480        0    45056       30             0 sh
[3340113.392359] [21453]  1000 21453     2211        0    53248       27             0 tee
[3340113.392361] [21455]  1000 21455  1154331     5839   622592    10454             0 java
[3340113.392362] [21714]  1000 21714   777895        0   696320    30295             0 java
[3340113.392364] [21743]  1000 21743   777508        0   679936    29266             0 java
[3340113.392365] [21773]  1000 21773   777766      983   700416    28785             0 java
[3340113.392367] [23404]  1000 23404  1621004   733817  7254016    65847             0 java
[3340113.392369] [23565]  1000 23565  1891469  1000758  9441280    24641             0 java
[3340113.392370] [23705]  1000 23705   964486   158287  2023424     8570             0 java
[3340113.392372] [23769]  1000 23769     2209       17    53248        0             0 sleep
[3340113.392374] [23777]     0 23777     2532      135    57344        0             0 systemd-logind
[3340113.392375] Out of memory: Kill process 23565 (java) score 448 or sacrifice child
[3340113.438997] Killed process 23565 (java) total-vm:7565876kB, anon-rss:4003052kB, file-rss:0kB, shmem-rss:0kB
[3340113.654549] oom_reaper: reaped process 23565 (java), now anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
vij-singh commented 1 year ago

@gita-omr - Any new news?

gita-omr commented 1 year ago

@Akira1Saitoh got interesting findings. The tests are actually very small but looks like a lot of them are running in parallel. I am not very familiar with testng but maybe there are settings to run fewer processes/threads? I think this would be the first thing to try. @JasonFengJ9

JasonFengJ9 commented 1 year ago

The tests are actually very small but looks like a lot of them are running in parallel. I am not very familiar with testng but maybe there are settings to run fewer processes/threads?

@gita-omr are you referring to run individual tests like jdk/incubator/vector/Byte128VectorTests.java (there is a grinder option to do so) instead of jdk_vector_0?

gita-omr commented 1 year ago

@gita-omr are you referring to run individual tests like jdk/incubator/vector/Byte128VectorTests.java (there is a grinder option to do so) instead of jdk_vector_0?

Maybe we can try that?

Akira1Saitoh commented 1 year ago

Is it possible to use jtreg option to set the concurrency to lower value?

Akira1Saitoh commented 1 year ago

I am seeing the failure on the internal machine with 8 cpu cores and 8 GB memory. The test does not fail on another machine with 8 cpu cores and 16 GB memory.

JasonFengJ9 commented 1 year ago

@gita-omr are you referring to run individual tests like jdk/incubator/vector/Byte128VectorTests.java (there is a grinder option to do so) instead of jdk_vector_0?

Maybe we can try that?

An example internal build from https://github.com/eclipse-openj9/openj9/issues/16244

Is it possible to use jtreg option to set the concurrency to lower value?

@llxia any insights?

llxia commented 1 year ago

Is it possible to use jtreg option to set the concurrency to lower value?

In OpenJDK tests, we set concurrency based on the # following: min(NPROCS/2, MEM_IN_GB/2). To change concurrency, change L66 in your branch. https://github.com/adoptium/aqa-tests/blob/d5e241df7fe9a7648520efe4ebe1defc5fa5fd7d/openjdk/openjdk.mk#L46-L66

Akira1Saitoh commented 1 year ago

Hmm. the concurrency on aarch64 machine (8GB memory) is 3 while it is 1 on x86 machine (ub20x64rt3) with 4GB memory. This might be the reason why the test only fails on aarch64.

on aarch64:

$ cat Makefile 

MEMORY_SIZE:=$(shell KMEMMB=`awk '/^MemTotal:/{print int($$2/1024)}' /proc/meminfo`; if [ -r /sys/fs/cgroup/memory/memory.limit_in_bytes ]; then CGMEM=`cat /sys/fs/cgroup/memory/memory.limit_in_bytes`; else CGMEM=`expr $${KMEMMB} \* 1024`; fi; CGMEMMB=`expr $${CGMEM} / 1048576`; if [ "$${KMEMMB}" -lt "$${CGMEMMB}" ]; then echo "$${KMEMMB}"; else echo "$${CGMEMMB}"; fi)

All:
        @echo ${MEMORY_SIZE}
$ make
7973
$ cat /proc/meminfo 
MemTotal:        8165204 kB

on x86:

$ make
3931

$ cat /proc/meminfo 
MemTotal:        4025600 kB
Akira1Saitoh commented 1 year ago

@llxia Thanks for the pointer. Is it possible to change the concurrency only for jdk_vector test?

llxia commented 1 year ago

@Akira1Saitoh I think it is possible to do it with new parameters:

In https://github.com/adoptium/aqa-tests/blob/d5e241df7fe9a7648520efe4ebe1defc5fa5fd7d/openjdk/openjdk.mk,

EXTRA_JTREG_OPTIONS += -concurrency:$(JTREG_CONC)
...
# set JTREG_BASIC_OPTIONS value into a new parameter before adding EXTRA_JTREG_OPTIONS
JTREG_BASIC_OPTIONS_WO_EXTRA_OPTS = $(JTREG_BASIC_OPTIONS)
JTREG_BASIC_OPTIONS += $(EXTRA_JTREG_OPTIONS)

# add another new parameter for concurrency
SPECIAL_CONCURRENCY=xxxx

In https://github.com/adoptium/aqa-tests/blob/d5e241df7fe9a7648520efe4ebe1defc5fa5fd7d/openjdk/playlist.xml#L1832, replace $(JTREG_BASIC_OPTIONS) with $(JTREG_BASIC_OPTIONS_WO_EXTRA_OPTS) $(SPECIAL_CONCURRENCY)

Akira1Saitoh commented 1 year ago

Thanks @llxia. I opened https://github.com/adoptium/aqa-tests/pull/4119. Launched the internal Grinder job job/Grinder/29324/ with that change.

llxia commented 1 year ago

Launched the internal Grinder job job/Grinder/29324/ with that change.

Grinder/29324 passed 8 times. It failed because FlowNode was not found at the end, which is unrelated to the change.

Akira1Saitoh commented 1 year ago

Re-launched the job. (job/Grinder/29389/)

Akira1Saitoh commented 1 year ago

Grinder/29389 passed 13 times. It failed due to FlowNode was not found error again.

Akira1Saitoh commented 1 year ago

Re-launched the Grinder. 20 iterations passed. (/job/Grinder/29428/)

tajila commented 1 year ago

@Akira1Saitoh Do you think this will be resolved within 2 weeks?

Akira1Saitoh commented 1 year ago

@tajila https://github.com/adoptium/aqa-tests/pull/4119 was already merged. Since then, I have not seen this failure in several Grinder runs. I believe this issue can be closed now. To address the excessive memory consumption when running jdk_vector_0, I will open another issue. JIT uses up the scratch memory (1.5GB) when compiling some methods in the test.

Akira1Saitoh commented 1 year ago

I opened https://github.com/eclipse-openj9/openj9/issues/16309.

llxia commented 1 year ago

@Akira1Saitoh once https://github.com/eclipse-openj9/openj9/issues/16309 is resolved, will we revert https://github.com/adoptium/aqa-tests/pull/4119?

Akira1Saitoh commented 1 year ago

@llxia yes, but I do not think we can resolve #16309 anytime soon.

vij-singh commented 1 year ago

Can this issue (#15268) be closed now given we have #16309 ?

0xdaryl commented 1 year ago

With the test case change, it sounds like this issue should be closed.

babsingh commented 1 year ago

This issue has been closed. But, the tests are still disabled under this issue.

The tests should be re-enabled if they pass now, and the test excluded label should be removed. If not, then the test excludes should point to the new issue, https://github.com/eclipse-openj9/openj9/issues/16309.

Akira1Saitoh commented 1 year ago

Running jdk_vector_0 with those tests enabled to see if they pass. /job/Grinder/30030/ https://github.com/Akira1Saitoh/openjdk-tests/commit/d1c0a2d64367a3d7a6617ed406ea1e3ce20db624

Akira1Saitoh commented 1 year ago

Several Grinder runs passed. I opened https://github.com/adoptium/aqa-tests/pull/4189 to re-enable those tests. /job/Grinder/30063/ /job/Grinder/30065/ /job/Grinder/30076/

Akira1Saitoh commented 1 year ago

https://github.com/adoptium/aqa-tests/pull/4189 has been merged. The tests are re-enabled now.