adoptium / infrastructure

This repo contains all information about machine maintenance.
Apache License 2.0
84 stars 100 forks source link

Problem machines for release #2662

Open Haroon-Khel opened 2 years ago

Haroon-Khel commented 2 years ago

test-docker-fedora34-x64-1 and (newly created) test-docker-fedora34-x64-2 ref https://github.com/adoptium/infrastructure/issues/2631 JDK8

The following tests are failing on both -1 and -2. Links are for -2 java/nio/file/Files/probeContentType/Basic.java https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/5133/console java/net/Inet6Address/B6206527.java.B6206527 https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/5135/console java/net/ipv6tests/B6521014.java https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/5136/console

test-osuosl-centos74-ppc64le-1/ and test-osuosl-centos74-ppc64le-2/ ref https://github.com/adoptium/infrastructure/issues/2625 JDK8

On test-osuosl-centos74-ppc64le-1

sun/security/pkcs11/fips/TestTLS12.java https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/5092/console

sun/tools/jinfo/Basic.sh https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/5094/console

On test-osuosl-centos74-ppc64le-2

sun/security/pkcs11/fips/TestTLS12.java https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/5095/console

~sun/tools/jinfo/Basic.sh~ resolved https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/5103/console

test-azure-win2012r2-x64-3 and test-azure-win2019-x64-1 ref https://github.com/adoptium/infrastructure/issues/2645 JDK11

sophia-guo commented 1 year ago

ERROR: Cannot delete workspace :Malformed input or input contains unmappable characters https://github.com/adoptium/infrastructure/issues/2630

sophia-guo commented 1 year ago

test-azure-win2012r2-x64-1

ERROR: Cannot delete workspace :Unable to delete 'D:\jenkins\workspace\Test_openjdk11_hs_sanity.openjdk_x86-64_windows\openjdkbinary\j2sdk-image\lib\modules'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.

Recent two run: https://ci.adoptopenjdk.net/job/Test_openjdk11_hs_sanity.openjdk_x86-64_windows/647/console https://ci.adoptopenjdk.net/job/Test_openjdk11_hs_sanity.openjdk_x86-64_windows/647/console

Haroon-Khel commented 1 year ago

That directory is being used by leftover jcmd.exe processes

image

https://github.com/adoptium/infrastructure/issues/2635 is related. It is surprising to find that this is occurring on a different machine this time

Haroon-Khel commented 1 year ago

sun/tools/jinfo/Basic.sh on the 2 linux ppc64le machines has been resolved, https://github.com/adoptium/infrastructure/issues/2625#issuecomment-1181699740

Haroon-Khel commented 1 year ago

Stewart has added jcmd to the list of process to kill https://ci.adoptopenjdk.net/view/Tooling/job/SXA-processCheck/, https://github.com/adoptium/infrastructure/issues/2635#issuecomment-1184611082

sophia-guo commented 1 year ago

test-azure-win2012r2-x64-1

https://ci.adoptopenjdk.net/job/Test_openjdk11_hs_sanity.openjdk_x86-64_windows/648/console https://ci.adoptopenjdk.net/job/Test_openjdk11_hs_sanity.openjdk_x86-64_windows/647/console https://ci.adoptopenjdk.net/job/Test_openjdk11_hs_sanity.openjdk_x86-64_windows/646/console

[WS-CLEANUP] Deleting project workspace...
[WS-CLEANUP] Deferred wipeout is disabled by the job configuration...
ERROR: Cannot delete workspace :Unable to delete 'D:\jenkins\workspace\Test_openjdk11_hs_sanity.openjdk_x86-64_windows\openjdkbinary\j2sdk-image\lib\modules'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
[Pipeline] }
[Pipeline] // timeout
[Pipeline] echo
Exception: hudson.AbortException: Cannot delete workspace: Unable to delete 'D:\jenkins\workspace\Test_openjdk11_hs_sanity.openjdk_x86-64_windows\openjdkbinary\j2sdk-image\lib\modules'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.

All three recent jobs are assigned to this machine and failed . all failed with running this specific machine

sxa commented 1 year ago

test-azure-win2012r2-x64-1

https://ci.adoptopenjdk.net/job/Test_openjdk11_hs_sanity.openjdk_x86-64_windows/648/console https://ci.adoptopenjdk.net/job/Test_openjdk11_hs_sanity.openjdk_x86-64_windows/647/console https://ci.adoptopenjdk.net/job/Test_openjdk11_hs_sanity.openjdk_x86-64_windows/646/console

[WS-CLEANUP] Deleting project workspace...
[WS-CLEANUP] Deferred wipeout is disabled by the job configuration...
ERROR: Cannot delete workspace :Unable to delete 'D:\jenkins\workspace\Test_openjdk11_hs_sanity.openjdk_x86-64_windows\openjdkbinary\j2sdk-image\lib\modules'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
[Pipeline] }
[Pipeline] // timeout
[Pipeline] echo
Exception: hudson.AbortException: Cannot delete workspace: Unable to delete 'D:\jenkins\workspace\Test_openjdk11_hs_sanity.openjdk_x86-64_windows\openjdkbinary\j2sdk-image\lib\modules'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.

All three recent jobs are assigned to this machine and failed . all failed with running this specific machine

Fixed as per https://github.com/adoptium/infrastructure/issues/2209#issuecomment-1185341489

Haroon-Khel commented 1 year ago

Machines that are still problematic:

Any fedora dockerstatic container. ref https://github.com/adoptium/infrastructure/issues/2631, any fedora container on https://ci.adoptopenjdk.net/computer/docker-packet-ubuntu2004-intel-1/ will pass ipv6 tests while those on https://ci.adoptopenjdk.net/computer/docker-packet-ubuntu2004-amd-1/ will fail them. The difference needs to be investigated. I cant get java/nio/file/Files/probeContentType/Basic.java to pass on any Fedora container, see https://github.com/adoptium/infrastructure/issues/2631#issuecomment-1185690200

test-osuosl-centos74-ppc64le-1 and -2 sun/tools/jinfo/Basic.sh now passes, but sun/security/pkcs11/fips/TestTLS12.java still fails. See https://github.com/adoptium/infrastructure/issues/2625#issuecomment-1181699740

test-azure-win2012r2-x64-3 and test-azure-win2019-x64-1 see https://github.com/adoptium/infrastructure/issues/2645#issuecomment-1177334175 Failures are intermittent, but more failures than passes.

If by Monday these issues are not resolved, I'll turn the jenkins nodes offline for the release

Haroon-Khel commented 1 year ago

I was able to get java/nio/file/Files/probeContentType/Basic.java to pass on our fedora boxes, see https://github.com/adoptium/infrastructure/issues/2631#issuecomment-1188992683, however I have not solved the failing ipv6 tests on fedora containers hosted on https://ci.adoptopenjdk.net/computer/docker-packet-ubuntu2004-amd-1/.

And sun/security/pkcs11/fips/TestTLS12.java continues to fail on test-osuosl-centos74-ppc64le-1 and -2, see https://github.com/adoptium/infrastructure/issues/2625

I have temporarily turned offline the following nodes for this release

https://ci.adoptopenjdk.net/computer/test-docker-fedora34-x64-1/ https://ci.adoptopenjdk.net/computer/test-docker-fedora34-x64-2/ https://ci.adoptopenjdk.net/computer/test-docker-fedora36-x64-1/ https://ci.adoptopenjdk.net/computer/test-osuosl-centos74-ppc64le-1/ https://ci.adoptopenjdk.net/computer/test-osuosl-centos74-ppc64le-2/

Haroon-Khel commented 1 year ago

https://ci.adoptopenjdk.net/computer/test-docker-fedora34-x64-1/ https://ci.adoptopenjdk.net/computer/test-docker-fedora34-x64-2/ https://ci.adoptopenjdk.net/computer/test-docker-fedora36-x64-1/ https://ci.adoptopenjdk.net/computer/test-osuosl-centos74-ppc64le-1/ https://ci.adoptopenjdk.net/computer/test-osuosl-centos74-ppc64le-2/

I've turned these machines back online

sxa commented 1 year ago

@Haroon-Khel Can you give a status update on the systems that were problematic - have they all now been resolved or is there still work to do here. Need to know whether it can be closed or whether it needs to move to October.

Haroon-Khel commented 1 year ago

Since sun/security/pkcs11/fips/TestTLS12.java continues to fail on test-osuosl-centos74-ppc64le-1 and -2 this issue should be kept open

sxa commented 1 year ago

Related: https://github.com/adoptium/infrastructure/issues/2815

Haroon-Khel commented 1 year ago

Ipv6 failures on new ppc64le machine https://github.com/adoptium/infrastructure/issues/2883 test-docker-ubuntu2204-ppc64le-1 test-docker-debian11-ppc64le-1 Could also affect: test-docker-ubuntu2204-ppc64le-2 test-docker-debian11-ppc64le-2 test-docker-debian11-ppc64le-3

https://github.com/adoptium/infrastructure/blob/952a0bddf784ddcae519661daf975f1abb693ec4/ansible/playbooks/AdoptOpenJDK_Unix_Playbook/roles/DockerStatic/tasks/main.yml#L6 has run on the machines during setup, annoyingly it isn't fixing the problem

https://github.com/adoptium/infrastructure/issues/2884 affects the same machines

Haroon-Khel commented 1 year ago

ref https://github.com/adoptium/infrastructure/issues/2886

Taking test-docker-centos8-x64-2

Haroon-Khel commented 1 year ago

Taking offline the following machines due to https://github.com/adoptium/infrastructure/issues/2884

test-docker-ubuntu2204-ppc64le-1 test-docker-debian11-ppc64le-1 test-docker-ubuntu2204-ppc64le-2 test-docker-debian11-ppc64le-2

Haroon-Khel commented 1 year ago

test-docker-ubi8-x64-2 and test-docker-fedora35-x64-1 both offline ref https://github.com/adoptium/infrastructure/issues/2882

Haroon-Khel commented 1 year ago

ref https://github.com/adoptium/infrastructure/issues/2885 test-ibmcloud-win2012r2-x64-1 offline

Haroon-Khel commented 1 year ago

These need to be addressed https://github.com/adoptium/adoptium/issues/200#issuecomment-1402131107

Haroon-Khel commented 1 year ago

test-docker-ubi8-x64-2 and test-docker-fedora35-x64-1 both offline ref https://github.com/adoptium/infrastructure/issues/2882

Closed off https://github.com/adoptium/infrastructure/issues/2882

Haroon-Khel commented 1 year ago

A quick summary

https://ci.adoptium.net/computer/test-docker-centos8-x64-2/ is offline due to https://github.com/adoptium/infrastructure/issues/2886

https://ci.adoptium.net/computer/test-docker-ubuntu2204-ppc64le-1/ and https://ci.adoptium.net/computer/test-docker-ubuntu2204-ppc64le-2/ are offline due to failing ipv6 tests, https://github.com/adoptium/infrastructure/issues/2949 and https://github.com/adoptium/infrastructure/issues/2884

https://ci.adoptium.net/computer/test-docker-ubuntu2204-x64-2 is offline due to https://github.com/adoptium/infrastructure/issues/2894#issuecomment-1467953191

Ive kept https://ci.adoptium.net/computer/test-docker-ubi8-x64-1 and https://ci.adoptium.net/computer/test-docker-fedora35-x64-1 online as only one jdk_net test fails on both https://github.com/adoptium/infrastructure/issues/3010

https://ci.adoptium.net/computer/test-ibmcloud-win2012r2-x64-1/ is offline due to https://github.com/adoptium/infrastructure/issues/2885#issuecomment-1385912369

sxa commented 1 year ago

@Haroon-Khel There seems to be quite a few test-docker machines that are in jenkins but not live, for example https://ci.adoptium.net/manage/computer/test%2Ddocker%2Dfedora37%2Darmv8%2D1/ - should they be removed from jenkins now? That one in particular seems to be the latest Fedora version so I'm a little surprised if it has been removed.

sxa commented 10 months ago

s390x issues:

sxa commented 10 months ago

Also we've been having some inconsistencies on test issues in https://github.com/adoptium/infrastructure/issues/2536 across different mac machines.

sxa commented 8 months ago

extended.perf dacapo-xalan-0 success varies depending on machine: https://github.com/adoptium/aqa-tests/issues/3122#issuecomment-1787036636

Haroon-Khel commented 7 months ago

Summary of AQA triage on s390x jdk-21.0.1+12.1 https://github.com/temurin-compliance/temurin-compliance/issues/431#issuecomment-1810092968 (ongoing)

MiniMix_aot_5m_0, DBBLoadTest_5m_0, DBBLoadTest_5m_1 intermittently pass on all machines, but fail consistently on test-marist-sles12-s390x-2 and test-marist-sles15-s390x-2

java/foreign/TestLargeSegmentCopy.java from jdk_foreign fails on test-marist-rhel8-s390x-2, test-marist-rhel7-s390x-2, test-marist-sles15-s390x-2

The following sanity system tests fail intermittently on all machines, but seem to fail consistently on test-marist-sles15-s390x-2

TestJlmRemoteClassAuth_1
TestJlmRemoteClassAuth_0
TestJlmRemoteClassNoAuth_0
TestJlmRemoteClassNoAuth_1
TestJlmRemoteMemoryAuth_0
TestJlmRemoteMemoryAuth_1
TestJlmRemoteMemoryNoAuth_0
TestJlmRemoteMemoryNoAuth_1
TestJlmRemoteNotifierProxyAuth_0
TestJlmRemoteNotifierProxyAuth_1
TestJlmRemoteThreadAuth_0
TestJlmRemoteThreadAuth_1
TestJlmRemoteThreadNoAuth_0
TestJlmRemoteThreadNoAuth_1
NioLoadTest_5m_0 
NioLoadTest_5m_1

The remaining failures below, from extended openjdk, are being run on all machines (grinders 8060 to 8067)

jdk_other_0 jdk_net_0 jdk_net_1 jdk_nio_0 jdk_nio_1 jdk_security3_0 jdk_security3_1 jdk_management_0 jdk_jmx_1 jdk_tools_0 jdk_tools_1 jdk_jfr_0 jdk_rmi_0 jdk_jdi_0

Grinder Machine Time Status
8060 test-ubuntu2004-1 18h32
8061 test-sles15-2 ABORTED after 40h (jdk_security_x = 7h each). Rerun 8077 (Next line!)
8077 test-sles15-2 No jdk_security_x, 345 failed [*]
8062 test-rhel7-2 ABORTED after 40h (jdk_security_x = 7h each) Rerun 8078 (Next line!)
8078 test-rhel7-2 No jdk_security_x 340 failures [*]
8063 test-ubuntu2204-1 28 hours 14 failures (mostly timeouts) Re-run failed targets 13 failures inc. multicast
8064 docker-sles12-1 17h11 1 fail: com.sun.jdi.FinalizerTest (re-run jdk_jdi_0 - same)
8065 test-rhel8-2 15h25 2 failures both in java.net.HttpClient (Re-run jdk_net-0/1 - 1 fail UdpSocket
8066 test-sles12-2 ABORTED after 40h (jdk_securty_x = 7h each) Rerun 8078 (Next line!)
8079 test-sles12-2 No jdk_security_x, 345 failures [*]
8067 docker-sles15-1 17h09 1 fail: sun.security.ssl.SSLSocketImpl (Re-run jdk_security3_0) PASS

[*] - the 340/345 failing tests Include many which are failing with something similar to this: Exception creating connection to: 148.100.74.92; nested exception is: java.net.NoRouteToHostException: No route to host |

jiekang commented 7 months ago

Data from the October CPU AQA triage can be found here: https://docs.google.com/spreadsheets/d/16vAQvYzL_-azDoD5OhQ6lObD3-suJwqKfjtABuWoIkc/edit#gid=1601438678

This has a summary sheet, and a sheet for JDK Version with a list of: suite failures, action taken, and if applicable, problematic machine and failure type.

This list should be used to help drive individual actions to improve test infrastructure and reduce the number of re-runs due to machine configuration related issues. The rows that have a 'Bad Machine' and 'Failure Type' listed should be investigated first.

There is also a list of 'To Investigate' topics in each JDK Version sheet that may not necessarily be machine configuration issues, but look promising to me to understand and resolve. When I get more cycles, I intend to open separate, individual issues for these in the appropriate repos.

Haroon-Khel commented 6 months ago

JDK17

test-docker-ubuntu2004-armv8l-3

Installed fontconfig, rerunning https://ci.adoptium.net/view/Test_grinder/job/Grinder/8281/console. Passes ✅ Need to install fontconfig everywhere

test-docker-ubuntu2010-armv8l-2

Unable to install fontconfig on Ubuntu 2010

Err:1 http://ports.ubuntu.com/ubuntu-ports groovy/main arm64 fonts-dejavu-core all 2.37-2
  404  Not Found [IP: 185.125.190.39 80]
E: Failed to fetch http://ports.ubuntu.com/ubuntu-ports/pool/main/f/fonts-dejavu/fonts-dejavu-core_2.37-2_all.deb  404  Not Found [IP: 185.125.190.39 80]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
...
root@93d2b4e13a22:~# apt-get update
Ign:1 http://ports.ubuntu.com/ubuntu-ports groovy InRelease
Ign:2 http://ports.ubuntu.com/ubuntu-ports groovy-updates InRelease
Ign:3 http://ports.ubuntu.com/ubuntu-ports groovy-backports InRelease
Ign:4 http://ports.ubuntu.com/ubuntu-ports groovy-security InRelease
Err:5 http://ports.ubuntu.com/ubuntu-ports groovy Release
  404  Not Found [IP: 185.125.190.39 80]
Err:6 http://ports.ubuntu.com/ubuntu-ports groovy-updates Release
  404  Not Found [IP: 185.125.190.39 80]
Err:7 http://ports.ubuntu.com/ubuntu-ports groovy-backports Release
  404  Not Found [IP: 185.125.190.39 80]
Err:8 http://ports.ubuntu.com/ubuntu-ports groovy-security Release
  404  Not Found [IP: 185.125.190.39 80]
Reading package lists... Done
E: The repository 'http://ports.ubuntu.com/ubuntu-ports groovy Release' no longer has a Release file.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
E: The repository 'http://ports.ubuntu.com/ubuntu-ports groovy-updates Release' no longer has a Release file.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
E: The repository 'http://ports.ubuntu.com/ubuntu-ports groovy-backports Release' no longer has a Release file.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
E: The repository 'http://ports.ubuntu.com/ubuntu-ports groovy-security Release' no longer has a Release file.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.

Looks like repo is no longer there, likely due to Ubuntu 2010 being EOL Update: This machine has been replaced with https://ci.adoptium.net/computer/test-docker-ubuntu2310-armv8l-1/ AQA test pipeline running on this machine https://ci.adoptium.net/job/AQA_Test_Pipeline/202/console

test-docker-sles12-s390x-1

Installed fontconfig-devel, rerunning https://ci.adoptium.net/view/Test_grinder/job/Grinder/8293/

test-marist-ubuntu2204-s390x-1

test-docker-fedora33-ppc64le-1

test-skytap-ubuntu2004-ppc64le-1

Haroon-Khel commented 6 months ago

JDK21

test-docker-centos8-x64-1

test-docker-debian11-ppc64le-2

test-docker-debian11-ppc64le-1

sxa commented 6 months ago

List of likely machine related tests which I'm giong to stop bumping between iterations so we can track based on their attachment to this issue:

RHEL/CentOS*:

AIX:

Linux/s390x:

sxa commented 5 months ago

List of test failures on JDK8/arm32 (at a minimum) including the perf test suites which are failing in the containerised environments on the arm64 hosts, but are ok on the two physical ODROID machines: https://github.com/adoptium/infrastructure/issues/3043

sxa commented 2 months ago

Stuff identified during April 2024 dry runs: