adoptium / aqa-tests

Home of test infrastructure for Adoptium builds
https://adoptium.net/aqavit
Apache License 2.0
132 stars 312 forks source link

Identify which tests seem unstable in docker containers #2138

Open sxa opened 3 years ago

sxa commented 3 years ago

This is partially for my own notes, but need to be looked at, and may also be covered elsewhere. Looks like the DDR stuff (not too surprising) will need some work

Other's (on initial look - not too deep!) seem ok

Memo to self - how to check for RAM/CPU limits in a container:

sxa commented 3 years ago

NOTE - runs on the Fedora docker image testing after patching and rebooting the server:

sxa commented 3 years ago

Also trying on a couple of X64 docker images (Fedora 33 and Ubuntu 20.04)

sxa commented 3 years ago

NUMA interrogation is failing in Docker

[EDIT: Issue shows up with just numactl -s in the container. A resolution is to use --cap=sys_nice which gives the container access to the CPU scheduling options - se docker docs for details]

sxa commented 3 years ago

core dump generation is also failing (I've tried starting the container with various options that might help but to no avail ... so far) ... potentially same as described in https://github.com/AdoptOpenJDK/run-aqa/issues/59

[EDIT: The (host) systems on which core files were not being produced had |/usr/share/apport/apport %p %s %c %d %P %E in /proc/sys/kernel/core_pattern - changing it to core resolves it (but we'll need to make that persistent) - raised https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1817]

sxa commented 3 years ago

Also not specific to docker, but we have seen instances if this when LANG is not set to en_US.UTF-8. It occurs only on OpenJ9 sanity.openjdk on JDK11 and above (not seen on 8 so far)

21:41:41  ACTION: main -- Failed. Execution failed: `main' threw exception: java.util.IllformedLocaleException: Ill-formed language: c.u [at index 0]
21:41:41  REASON: User specified action: run main/othervm -Duser.language.display=ja -Duser.language.format=zh LocaleCategory 
21:41:41  TIME:   8.802 seconds
21:41:41  messages:

This will be progressed via https://github.com/AdoptOpenJDK/run-aqa/issues/59

sophia-guo commented 3 years ago

Grinder on testc-packet-fedora33-amd-2 and got

ERROR: Error cloning remote repo 'origin'
hudson.plugins.git.GitException: Command "git fetch --tags --force --progress -- https://github.com/AdoptOpenJDK/openjdk-tests.git +refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: fatal: unable to access 'https://github.com/AdoptOpenJDK/openjdk-tests.git/': OpenSSL SSL_connect: Connection reset by peer in connection to github.com:443 

https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox/203/console

Suppose testc-packet-fedora33-amd-2 is one docker container?

sxa commented 3 years ago

Suppose testc-packet-fedora33-amd-2 is one docker container?

Yes - it's a docker container.

Hmmm that's a bit odd ... It's also nothing to do with the test if it's failing that early in the process. I've re-run it as 205 and it completed without any fatal failures so hopefully that won't occur, but if you see any further instances let me know so we can see if it happens regularly.

smlambert commented 3 years ago

From https://adoptopenjdk.slack.com/archives/C5219G28G/p1612761729068300, we should check whether the timeouthandler added to openj9 openjdk test runs is able to write a System dump in dockerized environment.

knn-k commented 3 years ago

I wonder if https://github.com/eclipse/openj9/issues/12038 is another example of failure in docker environments or not. "AssertionError: Free Physical Memory size cannot be greater than total Physical Memory Size."

sxa commented 3 years ago

I wonder if https://github.com/eclipse/openj9/issues/12038 is another example of failure in docker environments or not. "AssertionError: Free Physical Memory size cannot be greater than total Physical Memory Size."

Hmmm interesting thought. Certainly possibly but this is the first I've heard of it. Some of those containers we have are called in terms of CPU and RAM which could explain why you wouldn't necessarily be able to replicate locally without doing the same.

jerboaa commented 3 years ago

sanity.openjdk on JDK 8 (Hotspot) seems to randomly fail for these tests:

java/util/Arrays/TimSortStackSize2.java.TimSortStackSize2
java.lang.OutOfMemoryError: Java heap space
    at TimSortStackSize2.createArray(TimSortStackSize2.java:164)
    at TimSortStackSize2.doTest(TimSortStackSize2.java:59)
    at TimSortStackSize2.main(TimSortStackSize2.java:43)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127)
    at java.lang.Thread.run(Thread.java:748)

java/util/ResourceBundle/Bug4168625Test.java.Bug4168625Test 
14:10:19  ACTION: main -- Error. Agent communication error: java.io.EOFException; check console log for any additional details

java/lang/invoke/LFCaching/LFSingleThreadCachingTest.java.LFSingleThreadCachingTest 
Unexpected exit from test [exit code: 137]

See: https://ci.adoptopenjdk.net/view/Test_upstream/job/Test_openjdk8_hs_sanity.openjdk_x86-64_linux_upstream/75/

Especially LFSingleThreadCachingTest.java looks like an OOM kill. Would be nice to overlay that failure with the kernel OOM kill logs.

sxa commented 3 years ago

Above error was on test-docker-fedora33-x64-2 hosted on test-packet-ubuntu2004-amd-1. Those systems were all started with 4 cores and 6GB allocated to them. Re-testing at ~https://ci.adoptopenjdk.net/job/Grinder/7350 (Failed but I'm not sure if it's the same failure)~ Correct test from upstream at https://ci.adoptopenjdk.net/job/Grinder/7351

@smlambert In the log Severin referenced above it gives the Grinder re-run link for the individual test as https://ci.adoptopenjdk.net/job/Grinder/parambuild/?JDK_VERSION=8&JDK_IMPL=hotspot&JDK_VENDOR=oracle&BUILD_LIST=openjdk&PLATFORM=x86-64_linux_xl&TARGET=jdk_lang_1 which is clearly wrong as it doesn't reference upstream and the PLATFORM has _xl in it - is that a bug?

EDIT: https://ci.adoptopenjdk.net/job/Grinder/7353/console passed on a real machine (IBMCLOUD RHEL8) but https://ci.adoptopenjdk.net/job/Grinder/7350/console gfailed on the machine mentioned above (Both jdk_lang_1 target)

sxa commented 3 years ago

Potential resource starvation reported by @lumpfish on build-docker-fedora33-armv8-3 in https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/2002 - I see a "docker day" in my near future ... (Will diagnose using jdk_time-1):

06:58:21 TEST RESULT: Error. Program/home/jenkins/workspace/Test_openjdk16_hs_extended.openjdk_aarch64_linux/openjdkbinary/j2sdk-image/bin/java' timed out (timeout set to 960000ms, elapsed time including timeout handling was 1006476ms).`

sxa commented 3 years ago

At the moment at least some docker images hosted on build-packet-ubuntu1804-armv8-1 (U1804b_2223 in particular) this job currently running and docker-packet-ubuntu2004-amd-1 (U2004_2224 (this job currently running) in particular) are using a lot of CPU so potentially need to be properly capped. The failures being seen above may well only be occurring on those systems.

When the systems are quiesced tomorrow (since we're running the weekend piplines for JDK16 again due to https://github.com/AdoptOpenJDK/ci-jenkins-pipelines/pull/87) I can look at adjusting the capping of the tests

Related to @kumpfish's jdk_time_1 failure I have one pass at https://ci.adoptopenjdk.net/job/Grinder/7515/ on build-docker-ubuntu1804-armv8-​2 but all other attempts on the machine failued

sxa commented 3 years ago

OK I've brought the following offline for now while investigations occur as some of these have shown problems with jdk_time_1: build-docker--armv8- nodes hosted on build-packet-ubuntu1804-armv8-1 and docker-packet-ubuntu2004-intel-1)

jdk_time_1 has passed on the alibaba arm node and also test-docker-fedora-x64-1 (Failed at 7531 though) but at least it's just a recurring problem on all Fedora systems as it passed at 7506!)

sxa commented 3 years ago

sanity.openjdk on JDK 8 (Hotspot) seems to randomly fail for these tests:

java/util/Arrays/TimSortStackSize2.java.TimSortStackSize2
java.lang.OutOfMemoryError: Java heap space
  at TimSortStackSize2.createArray(TimSortStackSize2.java:164)
  at TimSortStackSize2.doTest(TimSortStackSize2.java:59)
  at TimSortStackSize2.main(TimSortStackSize2.java:43)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127)
  at java.lang.Thread.run(Thread.java:748)

java/util/ResourceBundle/Bug4168625Test.java.Bug4168625Test 
14:10:19  ACTION: main -- Error. Agent communication error: java.io.EOFException; check console log for any additional details

java/lang/invoke/LFCaching/LFSingleThreadCachingTest.java.LFSingleThreadCachingTest 
Unexpected exit from test [exit code: 137]

See: https://ci.adoptopenjdk.net/view/Test_upstream/job/Test_openjdk8_hs_sanity.openjdk_x86-64_linux_upstream/75/

Especially LFSingleThreadCachingTest.java looks like an OOM kill. Would be nice to overlay that failure with the kernel OOM kill logs.

This looks to be the same issue that's covered in https://github.com/AdoptOpenJDK/openjdk-tests/issues/2310 and not specific to docker

sxa commented 3 years ago

With the merging of https://github.com/AdoptOpenJDK/openjdk-tests/pull/2345 i've brought most systems back online - I've left build-docker-fedora33-armv8-5 build-docker-ubuntu1804-5 build-docker-ubuntu1804-6

[EDIT: Load on the machine during the nightly testing is sitting at under 16 and there are 64 cores so I have re-enabled these three remaining executors]

sophia-guo commented 3 years ago

Another one https://github.com/adoptium/adoptium/issues/63#issuecomment-894501202

sxa commented 3 years ago

@sophia-guo That looks like the tests have a dependency on the fakeroot tool which I wasn't aware we required. Can yuou supply a Grinder re-run link for that problem, as I'm not sure it'll be specific to docker - we do not have fakeroot available on all of our systems at present.

smlambert commented 3 years ago

Example run in Grinder: https://ci.adoptopenjdk.net/job/Grinder/1203

Rerun in Grinder on same machine link

sophia-guo commented 3 years ago

@sxa if I login in test machine I can run fakeroot, which means it is installed by default in Linux probably. Though aarch64 has the same issue, which I will open an issue in infra. https://github.com/adoptium/infrastructure/issues/2291

sophia-guo commented 2 years ago

on arm jdk11: java/beans/PropertyChangeSupport/Test4682386.java.Test4682386 java/beans/XMLEncoder/Test4631471.java.Test4631471 java/beans/XMLEncoder/Test4903007.java.Test4903007 java/beans/XMLEncoder/javax_swing_DefaultCellEditor.java.javax_swing_DefaultCellEditor java/beans/XMLEncoder/javax_swing_JTree.java.javax_swing_JTree javax/imageio/plugins/shared/ImageWriterCompressionTest.java.ImageWriterCompressionTest

passed on non-docker and failed on docker ones consistently. https://github.com/adoptium/aqa-tests/issues/2989#issuecomment-947114275

https://ci.adoptopenjdk.net/job/Test_openjdk11_hs_extended.openjdk_arm_linux_testList_2/9/

sophia-guo commented 2 years ago

java/beans/PropertyEditor/TestFontClassJava.java.TestFontClassJava java/beans/PropertyEditor/TestFontClassValue.java.TestFontClassValue java/beans/XMLEncoder/Test4631471.java.Test4631471 java/beans/XMLEncoder/Test4903007.java.Test4903007 java/beans/XMLEncoder/javax_swing_DefaultCellEditor.java.javax_swing_DefaultCellEditor java/beans/XMLEncoder/javax_swing_JTree.java.javax_swing_JTree javax/imageio/plugins/shared/ImageWriterCompressionTest.java.ImageWriterCompressionTest

error message:

Stacktrace
Execution failed: `main' threw exception: java.lang.NullPointerException: Cannot load from short array because "sun.awt.FontConfiguration.head" is null    
Standard Output
Property class: class java.awt.Font
PropertyEditor class: class com.sun.beans.editors.FontEditor

Standard Error
java.lang.NullPointerException: Cannot load from short array because "sun.awt.FontConfiguration.head" is null
    at java.desktop/sun.awt.FontConfiguration.getVersion(FontConfiguration.java:1262)
    at java.desktop/sun.awt.FontConfiguration.readFontConfigFile(FontConfiguration.java:224)

https://ci.adoptopenjdk.net/job/Test_openjdk18_hs_extended.openjdk_x86-64_linux_testList_2/26/

3640