Open sxa opened 3 years ago
NOTE - runs on the Fedora docker image testing after patching and rebooting the server:
Also trying on a couple of X64 docker images (Fedora 33 and Ubuntu 20.04)
NUMA interrogation is failing in Docker
[EDIT: Issue shows up with just numactl -s
in the container. A resolution is to use --cap=sys_nice
which gives the container access to the CPU scheduling options - se docker docs for details]
core dump generation is also failing (I've tried starting the container with various options that might help but to no avail ... so far) ... potentially same as described in https://github.com/AdoptOpenJDK/run-aqa/issues/59
[EDIT: The (host) systems on which core files were not being produced had |/usr/share/apport/apport %p %s %c %d %P %E
in /proc/sys/kernel/core_pattern
- changing it to core
resolves it (but we'll need to make that persistent) - raised https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1817]
Also not specific to docker, but we have seen instances if this when LANG
is not set to en_US.UTF-8
. It occurs only on OpenJ9 sanity.openjdk on JDK11 and above (not seen on 8 so far)
21:41:41 ACTION: main -- Failed. Execution failed: `main' threw exception: java.util.IllformedLocaleException: Ill-formed language: c.u [at index 0]
21:41:41 REASON: User specified action: run main/othervm -Duser.language.display=ja -Duser.language.format=zh LocaleCategory
21:41:41 TIME: 8.802 seconds
21:41:41 messages:
This will be progressed via https://github.com/AdoptOpenJDK/run-aqa/issues/59
Grinder on testc-packet-fedora33-amd-2 and got
ERROR: Error cloning remote repo 'origin'
hudson.plugins.git.GitException: Command "git fetch --tags --force --progress -- https://github.com/AdoptOpenJDK/openjdk-tests.git +refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout:
stderr: fatal: unable to access 'https://github.com/AdoptOpenJDK/openjdk-tests.git/': OpenSSL SSL_connect: Connection reset by peer in connection to github.com:443
https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox/203/console
Suppose testc-packet-fedora33-amd-2 is one docker container?
Suppose testc-packet-fedora33-amd-2 is one docker container?
Yes - it's a docker container.
Hmmm that's a bit odd ... It's also nothing to do with the test if it's failing that early in the process. I've re-run it as 205 and it completed without any fatal failures so hopefully that won't occur, but if you see any further instances let me know so we can see if it happens regularly.
From https://adoptopenjdk.slack.com/archives/C5219G28G/p1612761729068300, we should check whether the timeouthandler added to openj9 openjdk test runs is able to write a System dump in dockerized environment.
I wonder if https://github.com/eclipse/openj9/issues/12038 is another example of failure in docker environments or not. "AssertionError: Free Physical Memory size cannot be greater than total Physical Memory Size."
I wonder if https://github.com/eclipse/openj9/issues/12038 is another example of failure in docker environments or not. "AssertionError: Free Physical Memory size cannot be greater than total Physical Memory Size."
Hmmm interesting thought. Certainly possibly but this is the first I've heard of it. Some of those containers we have are called in terms of CPU and RAM which could explain why you wouldn't necessarily be able to replicate locally without doing the same.
sanity.openjdk on JDK 8 (Hotspot) seems to randomly fail for these tests:
java/util/Arrays/TimSortStackSize2.java.TimSortStackSize2
java.lang.OutOfMemoryError: Java heap space
at TimSortStackSize2.createArray(TimSortStackSize2.java:164)
at TimSortStackSize2.doTest(TimSortStackSize2.java:59)
at TimSortStackSize2.main(TimSortStackSize2.java:43)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127)
at java.lang.Thread.run(Thread.java:748)
java/util/ResourceBundle/Bug4168625Test.java.Bug4168625Test
14:10:19 ACTION: main -- Error. Agent communication error: java.io.EOFException; check console log for any additional details
java/lang/invoke/LFCaching/LFSingleThreadCachingTest.java.LFSingleThreadCachingTest
Unexpected exit from test [exit code: 137]
Especially LFSingleThreadCachingTest.java looks like an OOM kill. Would be nice to overlay that failure with the kernel OOM kill logs.
Above error was on test-docker-fedora33-x64-2 hosted on test-packet-ubuntu2004-amd-1. Those systems were all started with 4 cores and 6GB allocated to them. Re-testing at ~https://ci.adoptopenjdk.net/job/Grinder/7350 (Failed but I'm not sure if it's the same failure)~ Correct test from upstream at https://ci.adoptopenjdk.net/job/Grinder/7351
@smlambert In the log Severin referenced above it gives the Grinder re-run link for the individual test as https://ci.adoptopenjdk.net/job/Grinder/parambuild/?JDK_VERSION=8&JDK_IMPL=hotspot&JDK_VENDOR=oracle&BUILD_LIST=openjdk&PLATFORM=x86-64_linux_xl&TARGET=jdk_lang_1 which is clearly wrong as it doesn't reference upstream and the PLATFORM
has _xl
in it - is that a bug?
EDIT: https://ci.adoptopenjdk.net/job/Grinder/7353/console passed on a real machine (IBMCLOUD RHEL8) but https://ci.adoptopenjdk.net/job/Grinder/7350/console gfailed on the machine mentioned above (Both jdk_lang_1
target)
Potential resource starvation reported by @lumpfish on build-docker-fedora33-armv8-3 in https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/2002 - I see a "docker day" in my near future ... (Will diagnose using jdk_time-1
):
06:58:21 TEST RESULT: Error. Program
/home/jenkins/workspace/Test_openjdk16_hs_extended.openjdk_aarch64_linux/openjdkbinary/j2sdk-image/bin/java' timed out (timeout set to 960000ms, elapsed time including timeout handling was 1006476ms).`
At the moment at least some docker images hosted on build-packet-ubuntu1804-armv8-1 (U1804b_2223 in particular) this job currently running and docker-packet-ubuntu2004-amd-1 (U2004_2224 (this job currently running) in particular) are using a lot of CPU so potentially need to be properly capped. The failures being seen above may well only be occurring on those systems.
When the systems are quiesced tomorrow (since we're running the weekend piplines for JDK16 again due to https://github.com/AdoptOpenJDK/ci-jenkins-pipelines/pull/87) I can look at adjusting the capping of the tests
Related to @kumpfish's jdk_time_1
failure I have one pass at https://ci.adoptopenjdk.net/job/Grinder/7515/ on build-docker-ubuntu1804-armv8-2 but all other attempts on the machine failued
OK I've brought the following offline for now while investigations occur as some of these have shown problems with jdk_time_1
:
build-docker--armv8- nodes hosted on build-packet-ubuntu1804-armv8-1 and docker-packet-ubuntu2004-intel-1)
build-packet-ubuntu1804-armv8-1
)jdk_time_1
has passed on the alibaba arm node and also test-docker-fedora-x64-1 (Failed at 7531 though) but at least it's just a recurring problem on all Fedora systems as it passed at 7506!)
sanity.openjdk on JDK 8 (Hotspot) seems to randomly fail for these tests:
java/util/Arrays/TimSortStackSize2.java.TimSortStackSize2 java.lang.OutOfMemoryError: Java heap space at TimSortStackSize2.createArray(TimSortStackSize2.java:164) at TimSortStackSize2.doTest(TimSortStackSize2.java:59) at TimSortStackSize2.main(TimSortStackSize2.java:43) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) at java.lang.Thread.run(Thread.java:748) java/util/ResourceBundle/Bug4168625Test.java.Bug4168625Test 14:10:19 ACTION: main -- Error. Agent communication error: java.io.EOFException; check console log for any additional details java/lang/invoke/LFCaching/LFSingleThreadCachingTest.java.LFSingleThreadCachingTest Unexpected exit from test [exit code: 137]
Especially LFSingleThreadCachingTest.java looks like an OOM kill. Would be nice to overlay that failure with the kernel OOM kill logs.
This looks to be the same issue that's covered in https://github.com/AdoptOpenJDK/openjdk-tests/issues/2310 and not specific to docker
With the merging of https://github.com/AdoptOpenJDK/openjdk-tests/pull/2345 i've brought most systems back online - I've left build-docker-fedora33-armv8-5 build-docker-ubuntu1804-5 build-docker-ubuntu1804-6
[EDIT: Load on the machine during the nightly testing is sitting at under 16 and there are 64 cores so I have re-enabled these three remaining executors]
@sophia-guo That looks like the tests have a dependency on the fakeroot
tool which I wasn't aware we required. Can yuou supply a Grinder re-run link for that problem, as I'm not sure it'll be specific to docker - we do not have fakeroot
available on all of our systems at present.
Example run in Grinder: https://ci.adoptopenjdk.net/job/Grinder/1203
@sxa if I login in test machine I can run fakeroot
, which means it is installed by default in Linux probably. Though aarch64 has the same issue, which I will open an issue in infra. https://github.com/adoptium/infrastructure/issues/2291
on arm jdk11: java/beans/PropertyChangeSupport/Test4682386.java.Test4682386 java/beans/XMLEncoder/Test4631471.java.Test4631471 java/beans/XMLEncoder/Test4903007.java.Test4903007 java/beans/XMLEncoder/javax_swing_DefaultCellEditor.java.javax_swing_DefaultCellEditor java/beans/XMLEncoder/javax_swing_JTree.java.javax_swing_JTree javax/imageio/plugins/shared/ImageWriterCompressionTest.java.ImageWriterCompressionTest
passed on non-docker and failed on docker ones consistently. https://github.com/adoptium/aqa-tests/issues/2989#issuecomment-947114275
https://ci.adoptopenjdk.net/job/Test_openjdk11_hs_extended.openjdk_arm_linux_testList_2/9/
java/beans/PropertyEditor/TestFontClassJava.java.TestFontClassJava java/beans/PropertyEditor/TestFontClassValue.java.TestFontClassValue java/beans/XMLEncoder/Test4631471.java.Test4631471 java/beans/XMLEncoder/Test4903007.java.Test4903007 java/beans/XMLEncoder/javax_swing_DefaultCellEditor.java.javax_swing_DefaultCellEditor java/beans/XMLEncoder/javax_swing_JTree.java.javax_swing_JTree javax/imageio/plugins/shared/ImageWriterCompressionTest.java.ImageWriterCompressionTest
error message:
Stacktrace
Execution failed: `main' threw exception: java.lang.NullPointerException: Cannot load from short array because "sun.awt.FontConfiguration.head" is null
Standard Output
Property class: class java.awt.Font
PropertyEditor class: class com.sun.beans.editors.FontEditor
Standard Error
java.lang.NullPointerException: Cannot load from short array because "sun.awt.FontConfiguration.head" is null
at java.desktop/sun.awt.FontConfiguration.getVersion(FontConfiguration.java:1262)
at java.desktop/sun.awt.FontConfiguration.readFontConfigFile(FontConfiguration.java:224)
https://ci.adoptopenjdk.net/job/Test_openjdk18_hs_extended.openjdk_x86-64_linux_testList_2/26/
This is partially for my own notes, but need to be looked at, and may also be covered elsewhere. Looks like the DDR stuff (not too surprising) will need some work
testDDR*
cmdLineTester
andjit_hw_2
failcmdLineTester
tests-2
and not-1
- unrelated to docker?Other's (on initial look - not too deep!) seem ok
Memo to self - how to check for RAM/CPU limits in a container:
wc -l /sys/fs/cgroup/cpu,cpuacct/cgroup.procs
(Not accurate)cat /sys/fs/cgroup/memory/memory.limit_in_bytes / 1024 / 1024 / 1024
(Or divide by1073741824
)while true; do clear && uptime && docker stats --no-stream; sleep 60; done