adoptium / infrastructure

This repo contains all information about machine maintenance.
Apache License 2.0
85 stars 101 forks source link

Look at moving macstadium machines to orka #2536

Open sxa opened 2 years ago

sxa commented 2 years ago

I need to request a new machine:

Please explain what this machine is needed for:

sxa commented 1 year ago

As per discussion a few weeks ago that the action is on me to progress, George and I will look at this migration together.

sxa commented 1 year ago

Related: https://github.com/adoptium/temurin-build/issues/3354

sxa commented 1 year ago

Our orka systems have been deprovisioned due to inactivity - currently having negotiations to determine a way forward.

sxa commented 1 year ago

Discussions with MacStadium have indicated that an orka-based solution (which would not be sponsored at present) would be approximately twice the cost of the static systems which we have at present so we are looking at alternative options.

Here is a breakdown of the number of systems and their types we have at macstadium:

Use x64 aarch64
Build 2xG3 (4core) 2xG5G
Test 6xG3B (4core) 1xG4B (6core) 2xG5A
TCK 2xC3D (sml) 1xG4D (lge) 2xG5E

So that's a total of 4+9+5 = 18 systems. We currently have two hosted with MacInCloud with a potential option to increase that, particularly for x64 capacity

sxa commented 1 year ago

Looking at the performance of various systems, here are some runs of the JDK8/x64 extended.openjdk suite on the different machines:

System Time Failures?
TC G4D [*] 2h28 17 (hostname issues) 3702
TC G3D - i5/2C/8G [*] 6h51 Same hostname issues as G4D 3701
G3B - i7/4C/16G 3h03 All passed
G3B - i7/4C/16G [*] 3h38 Three failures in java.nio
aarch64 (Rosetta) [*] 2h24 14 failures
MacInCloud i7-8700B 3h15 1 failure in com/sun/jndi/ldap
G4B i7/6C/32G [*] 1h46 10 failures in net/nio/rmi

[*] - These machines have not typically been used for running the openjdk suites in the past so these may be newly visible failures. The second G3B machine was one of the build machines rather than one tagged for test.

So with the exception of the second line, the performance of these for running the full extended.openjdk suite looks reasonable. It should be noted that it is between 2x and 2.5x slower to run the same tests on JDK21 so around 8h for a G3B and 3h30 for a G4B.

sxa commented 1 year ago

Some other pieces of note:

sxa commented 1 year ago

Noting that JDK8 will not build on macos12 with Xcode 13:

checking for xcodebuild... /usr/bin/xcodebuild
configure: error: Xcode 6, 9-12 is required to build JDK 8, the version found was 13.1. Use --with-xcode-path to specify the location of Xcode or make Xcode active by using xcode-select.
No configurations found for /Users/jenkins/sxa/temurin-build/build-farm/workspace/build/src/! Please run configure to create a configuration.
Makefile:55: *** Cannot continue.  Stop.
OpenJDK make failed, archiving make failed logs

If I try a cross-compile from macos11/aarch64 with Xcode 12 I need to make a couple of other changes It can be made to try a build by adjusting mac.sh to ensure xcode-select -switch / is run, and using --openjdk-target=x86_64-apple-darwin in the configure args. However for JDK8 the build fails with some more errors:

error: use of undeclared identifier 'finite'; did you mean 'isfinite'?

Which seems to have been deprecated and then removed in earlier Xcode versions (Possible backport?)

Error: value size does not match register size specified by the constraint and modifier [-Werror,-Wasm-operand-widths]

may be more problematic

Haroon-Khel commented 1 year ago

Just to be rigorous, Ive kicked off the AQA test pipeline on all of our mac machines. JDK8 and 11 for x64, just 11 for arm. The focus is the build and test-macstadium machines, the other machines can be used as a 'control'

test-macstadium-macos1014-x64-1 https://ci.adoptium.net/job/AQA_Test_Pipeline/158/console test-macstadium-macos1014-x64-2 https://ci.adoptium.net/job/AQA_Test_Pipeline/157/console test-macstadium-macos11-arm64-1 https://ci.adoptium.net/job/AQA_Test_Pipeline/162/console test-macstadium-macos11-arm64-2 https://ci.adoptium.net/job/AQA_Test_Pipeline/161/console test-macstadium-macos1014-x64-3 https://ci.adoptium.net/job/AQA_Test_Pipeline/163/console test-macstadium-macos1014-x64-4 https://ci.adoptium.net/job/AQA_Test_Pipeline/164/console test-macstadium-macos1015-x64-1 https://ci.adoptium.net/job/AQA_Test_Pipeline/165/console build-macstadium-macos11-arm64-2 https://ci.adoptium.net/job/AQA_Test_Pipeline/166/console build-macstadium-macos11-arm64-1 https://ci.adoptium.net/job/AQA_Test_Pipeline/167/console build-macstadium-macos1014-x64-1 https://ci.adoptium.net/job/AQA_Test_Pipeline/168/console build-macstadium-macos1014-x64-2 https://ci.adoptium.net/job/AQA_Test_Pipeline/169/console test-macincloud-macos1201-x64-1 https://ci.adoptium.net/job/AQA_Test_Pipeline/170/console test-macincloud-macos1201-x64-2 https://ci.adoptium.net/job/AQA_Test_Pipeline/171/console

Haroon-Khel commented 1 year ago

Bit of a bad idea to run all of them at the same time. Some of the test jobs have expired even after 1 day.

Sifting through the tests that have finished and not expired, avoiding duplicates (ie if jdk_security1_0 and jdk_security1_1 have the same failed tests, only jdk_security1_0 is shown)

test-macstadium-macos11-arm64-1 jdk_security1_0,jdk_security4_0,jdk_util_0,jdk_svc_sanity_0,jvm_compiler_0,jdk_io_0,jdk_other_0,jdk_net_0,jdk_net_0,jdk_time_0,jdk_tools_0,jdk_jfr_0,jdk_jdi_0,jdk_security_infra_0

test-macstadium-macos11-arm64-2 (same failures as -1) jdk_security1_0,jdk_security4_0,jdk_util_0,jdk_svc_sanity_0,jvm_compiler_0,jdk_io_0,jdk_other_0,jdk_net_0,jdk_net_0,jdk_time_0,jdk_tools_0,jdk_jfr_0,jdk_jdi_0,jdk_security_infra_0

build-macstadium-macos11-arm64-2 jdk_math_1,jdk_security1_0,jdk_security4_0,jdk_util_0,jdk_svc_sanity_0,jvm_compiler_0,jdk_io_0,jdk_other_0,jdk_net_0,jdk_security3_0,jdk_time_0,jdk_tools_0,jdk_jfr_0,jdk_jdi_0 ,jdk_security_infra_0

build-macstadium-macos11-arm64-1 (same failures as -2) jdk_math_1,jdk_security1_0,jdk_security4_0,jdk_util_0,jdk_svc_sanity_0,jvm_compiler_0,jdk_io_0,jdk_other_0,jdk_net_0,jdk_security3_0,jdk_time_0,jdk_tools_0,jdk_jfr_0,jdk_jdi_0 ,jdk_security_infra_0

sxa commented 1 year ago

So the failures you've got are only from the arm64 ones? And are all those targets from the openjdk suite - where the others targets all good? I'm a bit surprised we're seeing issues on arm64 when using the arm64 builds - I would expect some issues when trying to run the x64 ones on arm64 but it looks like you've run those with the real arm64 build - is that correct? I'm particularly interested in test-macstadium-macos1014-x64-4 and the build-x64 ones so if those results have got lost we should get those re-run

Haroon-Khel commented 1 year ago
Machine Xcode version JDK11 x64 build JDK17 x64 build JDK20 x64 build
build-macstadium-macos11-arm64-1 Apple clang version 12.0.0 (clang-1200.0.32.29) build build build
build-macstadium-macos11-arm64-2 Apple clang version 12.0.0 (clang-1200.0.32.29) build build build
test-macstadium-macos11-arm64-1 Apple clang version 12.0.0 (clang-1200.0.32.29) build build build
test-macstadium-macos11-arm64-2 Apple clang version 12.0.0 (clang-1200.0.32.29) build build build
test-macincloud-macos1201-x64-1 Apple clang version 13.0.0 (clang-1300.0.29.3) build build build
test-macincloud-macos1201-x64-2 Apple clang version 13.1.6 (clang-1316.0.21.2.3) build build build

Can only kick off one build job at a time and on one machine at a time 😅 , this will take a while

sxa commented 1 year ago

A couple of other things to add to this list - see if we can build ok on clang13 on macos12 (The two macincloud machines) but also see if we can install the older version of xcode (The one used for JDK8) on a newer macos version.

Haroon-Khel commented 11 months ago

Notes from building x64 jdk8 on my m1 mac

Install xcode11.7. I can do this on my own mac (with GUI), need to find a way to do this headless

Switch to xcode 11.7 xcode-select -switch 'path to Xcode11.7'

Install 'intel' homebrew into /usr/local/Homebrew, requires a new Rosetta bash shell

arch -x86_64 /usr/bin/env bash
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Back to a non Rosetta shell: export PKG_CONFIG_PATH="/usr/local/lib/pkgconfig" Install intel libpng (for freetype) arch -x86_64 brew install libpng

Command to run build

arch -x86_64 ./makejdk-any-platform.sh --clean-git-repo --jdk-boot-dir 'path to x64 jdk8 mac binary'/Contents/Home --configure-args '--with-toolchain-type=clang --openjdk-target=x86_64-apple-darwin --with-cups=/opt/homebrew/opt/cups/' --target-file-name jdk8_x64.tar.gz --build-variant temurin jdk8u

If theres still errors with the freetype compilation, install intel freetype and rerun build arch -x86_64 brew install freetype

Haroon-Khel commented 11 months ago

I built another x64 jdk8 binary on build-macstadium-macos11-arm64-1 and uploaded it to jenkins here

I kicked off the aqa test pipeline, https://ci.adoptium.net/job/AQA_Test_Pipeline/173/console. Only sanity openjdk failed https://ci.adoptium.net/job/Test_openjdk8_hs_sanity.openjdk_x86-64_mac/883/

jdk_jdi_jdk8_0
 com/sun/jdi/RedefineCrossEvent.java.RedefineCrossEvent
 com/sun/jdi/PrivateTransportTest.sh.PrivateTransportTest
Haroon-Khel commented 11 months ago

In the interest of seeing how x64 mac tests run on arm64 mac, i kicked off https://ci.adoptium.net/job/AQA_Test_Pipeline/174/console (jdk11 aqa tests on test-macstadium-macos11-arm64-1

Most tests passed. Failing ones are:

Jlink_ReqMod

MathLoadTest_all_5m

jdk_io
   java/io/Serializable/serialFilter/GlobalFilterTest.java

jdk_time
   java/time/test/java/time/format/TestUTCParse.java

jdk_jfr_0 44 failed tests

jdk_jdi
   com/sun/jdi/JdbOptions.java

jdk_security_infra
   security/infra/java/security/cert/CertPathValidator/certification/GoogleCA.java

jdk_svc_sanity
   jdk/jfr/jcmd/TestJcmdStartStopDefault.java
Haroon-Khel commented 11 months ago

Ref https://github.com/adoptium/infrastructure/issues/2536#issuecomment-1714401394

com/sun/jdi/RedefineCrossEvent.java.RedefineCrossEvent is excluded on openj9, https://github.com/adoptium/aqa-tests/blob/80e978693163b65ce6d3caabeb823ba594766167/openjdk/excludes/ProblemList_openjdk8-openj9.txt#L333

Known issue https://github.com/adoptium/aqa-tests/issues/227, it fails the same way

Execution failed: `main' threw exception: com.sun.jdi.VMDisconnectedException: connection is closed    

Rerunning com/sun/jdi/PrivateTransportTest.sh.PrivateTransportTest on test-macstadium-macos1014-x64-2 https://ci.adoptium.net/job/Grinder/7564/console. Test passed ✅

So a cross compiled x64 jdk8 binary passes the tests in the AQA pipeline. Excellent news

Haroon-Khel commented 11 months ago

Ref https://github.com/adoptium/infrastructure/issues/2536#issuecomment-1721206291

Rerunning the failing tests on different arm64 mac machines to rule out infra related failure

Jlink_ReqMod, MathLoadTest_all_5m https://ci.adoptium.net/view/Test_grinder/job/Grinder/7568/console on build-macstadium-macos11-arm64-2

MathLoadTest_all_5m passed, rerunning Jlink_ReqMod on build-macstadium-macos11-arm64-1 https://ci.adoptium.net/view/Test_grinder/job/Grinder/7574/console

On build-macstadium-macos11-arm64-1 java/io/Serializable/serialFilter/GlobalFilterTest.java https://ci.adoptium.net/job/Grinder/7569/consolejava/time/test/java/time/format/TestUTCParse.java https://ci.adoptium.net/job/Grinder/7570/console com/sun/jdi/JdbOptions.java https://ci.adoptium.net/job/Grinder/7571/consolesecurity/infra/java/security/cert/CertPathValidator/certification/GoogleCA.java https://ci.adoptium.net/job/Grinder/7572/console jdk/jfr/jcmd/TestJcmdStartStopDefault.java https://ci.adoptium.net/job/Grinder/7573/console

security/infra/java/security/cert/CertPathValidator/certification/GoogleCA.java rerun https://ci.adoptium.net/job/Grinder/7575/console on build-macstadium-macos11-arm64-2

java/time/test/java/time/format/TestUTCParse.java rerun https://ci.adoptium.net/job/Grinder/7576/console on build-macstadium-macos11-arm64-2

Haroon-Khel commented 11 months ago

Ive modified https://github.com/adoptium/infrastructure/blob/6dff77f14bab907d90d2f16b61ac8f0e96b60b3a/ansible/playbooks/AdoptOpenJDK_Unix_Playbook/roles/Xcode/tasks/main.yml#L76 to install Xcode11.7 onto arm64 macs, but when I run the playbook it hangs at that task for a considerable amount of time (so far its been 30mins and no change)

On the remote machine I can see the xcversion process running

Haroon-Khel commented 11 months ago

If I try to install Xcode11.7 in an ssh session using the ansible commands linked above I get this error

%xip: error: The archive “Xcode_11.7.xip” is damaged and can’t be expanded.
No `Xcode.app(or Xcode-beta.app)` found in XIP. Please remove /Users/administrator/Library/Caches/XcodeInstall/Xcode_11.7.xip if you suspect a corrupted download or run `xcversion update` to see if the version you tried to install has been pulled by Apple. If none of this is true, please open a new GH issue. 
administrator@test-macstadium-macos11-arm64-1 ~ % 

I've tried an xcversion update but it still fails

Haroon-Khel commented 11 months ago

It seems xcversion is no longer supported https://github.com/xcpretty/xcode-install/blob/master/MIGRATION.md

Im trying out the suggested alternative https://github.com/XcodesOrg/xcodes but am hitting library errors

administrator@test-macstadium-macos11-arm64-1 ~ % xcodes --help
dyld: lazy symbol binding failed: can't resolve symbol _swift_task_create in /opt/homebrew/bin/xcodes because dependent dylib @rpath/libswift_Concurrency.dylib could not be loaded
dyld: can't resolve symbol _swift_task_create in /opt/homebrew/bin/xcodes because dependent dylib @rpath/libswift_Concurrency.dylib could not be loaded
zsh: abort      xcodes --help

I think @rpath/libswift_Concurrency.dylib comes with xcode 13 which is not yet on the machine

Haroon-Khel commented 10 months ago

ref https://github.com/adoptium/ci-jenkins-pipelines/pull/825#issuecomment-1759488201

I have temporarily added x64 labels to build-macstadium-macos11-arm64-1 build-macstadium-macos11-arm64-2 to allow x64 mac jdk build jobs to run on them.

Their PATH variables in their jenkins config has (temporarily) been changed from /usr/local/bin/:$PATH:/opt/homebrew/bin to /opt/homebrew/Cellar/git/2.42.0/bin:/usr/local/bin/:$PATH:/opt/homebrew/bin

sxa commented 10 months ago

Work on the orka setup is still ongoing at macstadium - will update this ticket when things move forward

gdams commented 9 months ago

I have managed to get an Arm64 Orka VM to successfully compile JDK8u using Haroon's changes to the playbook:

https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-mac-x64-temurin/436/

gdams commented 9 months ago

Now onto JDK11+ which will require a different version of XCode to be installed

gdams commented 9 months ago

Intel tests are passing on the intel VM image. I'm going to take the fixed ones offline to see if Orka can cope

gdams commented 9 months ago

JDK11 build completed using XCode command line tools (same as before) https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk11u/job/jdk11u-mac-x64-temurin/327/

gdams commented 9 months ago

JDK17 x64 build: https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk17u/job/jdk17u-mac-x64-temurin/405/ JDK17 aarch64 build: https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk17u/job/jdk17u-mac-aarch64-temurin/351/

gdams commented 9 months ago

Trying again with Xcode 15.0.1:

JDK21 x64: https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk21u/job/jdk21u-mac-x64-temurin/37/ JDK21 aarch64: https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk21u/job/jdk21u-mac-aarch64-temurin/35/ JDK17 x64: https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk17u/job/jdk17u-mac-x64-temurin/406/ JDK17 aarch64: https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk17u/job/jdk17u-mac-aarch64-temurin/352/

gdams commented 9 months ago

Right now the main issues I'm seeing are with the VPN expiring after a certain amount of time, this should be resolved once the firewall is configured to allow Jenkins in

sxa commented 8 months ago

@gdams Not sure it's been explicitly mentioned in here but since it came up int he PMC this week can you clarify the reason for moving to XCode 15? The openjdk build matrix lists 12 as the Oracle-supported compiler, with 13.1 as "known good" too. It seems possibly that this is the cause of a lot of warnings showing in the build: https://github.com/adoptium/temurin-build/issues/3562 so we should consider how to handle this.

smlambert commented 8 months ago

Still seeing some Terminated failures, for example, from https://ci.adoptium.net/job/Test_openjdk8_hs_sanity.openjdk_x86-64_mac/938/console:

14:17:45  TESTING:
14:17:47  Directory "/Users/admin/workspace/workspace/Test_openjdk8_hs_sanity.openjdk_x86-64_mac/aqa-tests/TKG/../TKG/output_17023199822147/jdk_lang_1/work" not found: creating
14:17:47  Directory "/Users/admin/workspace/workspace/Test_openjdk8_hs_sanity.openjdk_x86-64_mac/aqa-tests/TKG/../TKG/output_17023199822147/jdk_lang_1/report" not found: creating
14:17:49  XML output with verification to /Users/admin/workspace/workspace/Test_openjdk8_hs_sanity.openjdk_x86-64_mac/aqa-tests/TKG/output_17023199822147/jdk_lang_1/work
14:36:34  make[1]: *** [sanity.openjdk-..] Terminated: 15
14:36:34  make: *** [_sanity.openjdk] Terminated: 15
14:36:34  /Users/admin/workspace/workspace/Test_openjdk8_hs_sanity.openjdk_x86-64_mac@tmp/durable-bbc054c3/script.sh: line 1:  1778 Terminated: 15          $MAKE _sanity.openjdk
14:36:34  make[2]: *** [sanity.openjdk-openjdk] Terminated: 15
14:36:34  make[3]: *** [jdk_lang_1] Terminated: 15
[Pipeline] sh
sxa commented 8 months ago

Still seeing some Terminated failures, for example, from https://ci.adoptium.net

Grinding away to see how reproducible this is and if there's any consistency:

Noting that "Worked?" is going to indicate:

Grinder arch machine Worked? Comment
8244 x64 cloud-2 jdi failures
8247 x64 j4dtq
8248 x64 cloud-2 jdi failures
8249 x64 cloud-1 jdi failures
8250 x64 bnxp5
8251 x64 6zdxr
8252 x64 4jxrn
8253 aarch64 ckvq7
8254 aarch64 lwrdg
8255 aarch64 7gffk
8256 aarch64 gnvw4
8257 aarch64 fm88f
8258 aarch64 d9tcd
8259 aarch64 q5bmc

Noting that the macincloud x64 machines completed sanity.openjdk in about 40 minutes, the orka ones took about 1h30

None of them seemed to have any of the unexpected termination problems. Although since these were all kicked off in parallel there were unlikely to have re-used any existing machines...

sxa commented 8 months ago

Looking at some recently failing jobs on macos:

JDK8 extended.system#929 - Failed during the setup phase

18:35:37  Uncompressing file: OpenJDK8U-jdk_x64_mac_hotspot_2023-12-13-18-05.tar.gz ...
18:35:44  Cannot contact test-orka-macos14-x64-5p7nd: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
04:35:12  Cancelling nested steps due to timeout

JDK8 extended.functional#572 - Failed a few minutes after the start

18:17:34  TESTING:
18:17:35  Directory "/home/jenkins/workspace/Test_openjdk8_hs_extended.functional_x86-64_linux/aqa-tests/TKG/../TKG/output_17024914523857/CryptoTests_0/work" not found: creating
18:17:35  Directory "/home/jenkins/workspace/Test_openjdk8_hs_extended.functional_x86-64_linux/aqa-tests/TKG/../TKG/output_17024914523857/CryptoTests_0/report" not found: creating
18:17:35  XML output  to /home/jenkins/workspace/Test_openjdk8_hs_extended.functional_x86-64_linux/aqa-tests/TKG/output_17024914523857/CryptoTests_0/work
18:17:42  make[1]: *** [settings.mk:356: extended.functional-..] Terminated
18:17:42  make: *** [makefile:65: _extended.functional] Terminated
18:17:42  Terminated
18:17:42  make[2]: *** [/home/jenkins/workspace/Test_openjdk8_hs_extended.functional_x86-64_linux/aqa-tests/TKG/../TKG/settings.mk:356: extended.functional-functional] Terminated
18:17:42  make[3]: *** [/home/jenkins/workspace/Test_openjdk8_hs_extended.functional_x86-64_linux/aqa-tests/TKG/../TKG/settings.mk:356: extended.functional-security] Terminated
18:17:42  make[4]: *** [/home/jenkins/workspace/Test_openjdk8_hs_extended.functional_x86-64_linux/aqa-tests/TKG/../TKG/settings.mk:356: extended.functional-Crypto] Terminated
18:17:42  make[5]: *** [autoGen.mk:31: CryptoTests_0] Terminated
[Pipeline] sh

JDK17 extended.functional - 2-3 minutes after the start

00:04:49  TESTING:
00:04:50  Directory "/Users/admin/workspace/workspace/Test_openjdk17_hs_extended.functional_x86-64_mac/aqa-tests/TKG/../TKG/output_17024258838618/CryptoTests_0/work" not found: creating
00:04:50  Directory "/Users/admin/workspace/workspace/Test_openjdk17_hs_extended.functional_x86-64_mac/aqa-tests/TKG/../TKG/output_17024258838618/CryptoTests_0/report" not found: creating
00:04:51  XML output  to /Users/admin/workspace/workspace/Test_openjdk17_hs_extended.functional_x86-64_mac/aqa-tests/TKG/output_17024258838618/CryptoTests_0/work
00:06:46  Cannot contact test-orka-macos14-x64-jpxkh: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@3c1aacd7:test-orka-macos14-x64-jpxkh": Remote call on test-orka-macos14-x64-jpxkh failed. The channel is closing down or has closed down
10:01:31  Cancelling nested steps due to timeout
10:01:31  Could not connect to test-orka-macos14-x64-jpxkh to send interrupt signal to process

JDK17 sanity.openjdk - very early failure JDk11 sanity.openjdk - within five minutes of job start

18:43:50  Directory "/Users/admin/workspace/workspace/Test_openjdk11_hs_sanity.openjdk_x86-64_mac/aqa-tests/TKG/../TKG/output_17024066213950/jdk_lang_0/work" not found: creating
18:43:50  Directory "/Users/admin/workspace/workspace/Test_openjdk11_hs_sanity.openjdk_x86-64_mac/aqa-tests/TKG/../TKG/output_17024066213950/jdk_lang_0/report" not found: creating
18:44:23  XML output with verification to /Users/admin/workspace/workspace/Test_openjdk11_hs_sanity.openjdk_x86-64_mac/aqa-tests/TKG/output_17024066213950/jdk_lang_0/work
18:52:45  Cannot contact test-orka-macos14-x64-zr7cl: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@51d72d7:test-orka-macos14-x64-zr7cl": Remote call on test-orka-macos14-x64-zr7cl failed. The channel is closing down or has closed down
04:39:57  Cancelling nested steps due to timeout
04:39:57  Could not connect to test-orka-macos14-x64-zr7cl to send interrupt signal to process
[Pipeline] sh

This is looking like it might be the Orka system decommissioning the machine because it thinks it's no longer used after being provisioned in a previous run but it's not immediately clear. Looking at the last one there is an entry in the jenkins log from two minutes later about it being deleted (subject to time sync being correct)

[12/12/23 19:39:24] SSH Launch of test-orka-macos14-x64-zr7cl on xxx.yyy.zz.aa completed in 30,723 ms
jenkins.log.1:2023-12-13 04:41:44.163+0000 [id=107] INFO    h.slaves.CloudRetentionStrategy#check: Disconnecting test-orka-macos14-x64-zr7cl
jenkins.log.1:2023-12-13 04:41:44.163+0000 [id=107] INFO    i.j.p.orka.OrkaProvisionedAgent#_terminate: Terminating agent. VM id: test-orka-macos14-x64-zr7cl
jenkins.log.1:2023-12-13 04:41:44.201+0000 [id=107] INFO    i.jenkins.plugins.orka.OrkaCloud#deleteVM: VM test-orka-macos14-x64-zr7cl is successfully deleted.
sxa commented 7 months ago

@gdams has raised the disconnect issues with MacStadium. Awaiting a response.

smlambert commented 7 months ago

Regularity of x64 mac test jobs being terminated / disconnected seems to have increased (4/9 of the dry run jobs fail to run). jdk17 dry run pipeline

Screenshot 2024-01-11 at 9 09 49 PM

aarch64 mac test jobs seem not to suffer from this problem (as frequently, if at all)

Screenshot 2024-01-11 at 9 13 22 PM
smlambert commented 7 months ago

jdk8 dry run pipeline

Screenshot 2024-01-11 at 10 35 57 PM
smlambert commented 4 months ago

Unable jdk_net and jdk_nio 4 test cases related to multicasting do not pass on Orka machines, details here: https://github.com/adoptium/aqa-tests/issues/5156#issuecomment-2008018350

jdk_net TEST: java/net/DatagramSocket/DatagramSocketExample.java TEST: java/net/DatagramSocket/DatagramSocketMulticasting.java

jdk_nio TEST: java/nio/channels/DatagramChannel/AdaptorMulticasting.java TEST: java/nio/channels/DatagramChannel/BasicMulticastTests.java

sxa commented 4 months ago

@gdams as discussed - here are some examples of the errors I'm seeing in the jenkins log as a result of Orka:

Unable to make field private static final long java.nio.channels.ClosedChannelException.serialVersionUID accessible 2024-04-09 22:00:28.171+0000 [id=2688702] WARNING jenkins.util.Listeners#lambda$notify$0 java.lang.reflect.InaccessibleObjectException: Unable to make field private static final long java.nio.channels.ClosedChannelException.serialVersionUID accessible: module java.base does not "opens java.nio.channels" to unnamed module @6be968ce at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:354) at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297) at java.base/java.lang.reflect.Field.checkCanSetAccessible(Field.java:178) at java.base/java.lang.reflect.Field.setAccessible(Field.java:172) at com.thoughtworks.xstream.converters.reflection.FieldDictionary.buildDictionaryEntryForClass(FieldDictionary.java:176) at com.thoughtworks.xstream.converters.reflection.FieldDictionary.buildMap(FieldDictionary.java:142) at com.thoughtworks.xstream.converters.reflection.FieldDictionary.fieldsFor(FieldDictionary.java:80) at com.thoughtworks.xstream.converters.reflection.PureJavaReflectionProvider.visitSerializableFields(PureJavaReflectionProvider.java:167) at hudson.util.RobustReflectionConverter.doMarshal(RobustReflectionConverter.java:206) at hudson.util.RobustReflectionConverter.marshal(RobustReflectionConverter.java:163) at com.thoughtworks.xstream.converters.extended.ThrowableConverter.marshal(ThrowableConverter.java:62) at com.thoughtworks.xstream.core.AbstractReferenceMarshaller.convert(AbstractReferenceMarshaller.java:68) at com.thoughtworks.xstream.core.TreeMarshaller.convertAnother(TreeMarshaller.java:59) at com.thoughtworks.xstream.core.AbstractReferenceMarshaller$1.convertAnother(AbstractReferenceMarshaller.java:83) at hudson.util.RobustReflectionConverter.marshallField(RobustReflectionConverter.java:283) at hudson.util.RobustReflectionConverter$2.writeField(RobustReflectionConverter.java:270) Caused: java.lang.RuntimeException: Failed to serialize hudson.slaves.OfflineCause$ChannelTermination#cause for class hudson.slaves.OfflineCause$ChannelTermination at hudson.util.RobustReflectionConverter$2.writeField(RobustReflectionConverter.java:274) at hudson.util.RobustReflectionConverter$2.visit(RobustReflectionConverter.java:241) at com.thoughtworks.xstream.converters.reflection.PureJavaReflectionProvider.visitSerializableFields(PureJavaReflectionProvider.java:174) at hudson.util.RobustReflectionConverter.doMarshal(RobustReflectionConverter.java:226) at hudson.util.RobustReflectionConverter.marshal(RobustReflectionConverter.java:163) at com.thoughtworks.xstream.core.AbstractReferenceMarshaller.convert(AbstractReferenceMarshaller.java:68) at com.thoughtworks.xstream.core.TreeMarshaller.convertAnother(TreeMarshaller.java:59) at com.thoughtworks.xstream.core.AbstractReferenceMarshaller$1.convertAnother(AbstractReferenceMarshaller.java:83) at hudson.util.RobustReflectionConverter.marshallField(RobustReflectionConverter.java:283) at hudson.util.RobustReflectionConverter$2.writeField(RobustReflectionConverter.java:270) Caused: java.lang.RuntimeException: Failed to serialize hudson.model.Node#temporaryOfflineCause for class hudson.slaves.DumbSlave at hudson.util.RobustReflectionConverter$2.writeField(RobustReflectionConverter.java:274) at hudson.util.RobustReflectionConverter$2.visit(RobustReflectionConverter.java:241) at com.thoughtworks.xstream.converters.reflection.PureJavaReflectionProvider.visitSerializableFields(PureJavaReflectionProvider.java:174) at hudson.util.RobustReflectionConverter.doMarshal(RobustReflectionConverter.java:226) at hudson.util.RobustReflectionConverter.marshal(RobustReflectionConverter.java:163) at com.thoughtworks.xstream.core.AbstractReferenceMarshaller.convert(AbstractReferenceMarshaller.java:68) at com.thoughtworks.xstream.core.TreeMarshaller.convertAnother(TreeMarshaller.java:59) at com.thoughtworks.xstream.core.TreeMarshaller.convertAnother(TreeMarshaller.java:44) at com.thoughtworks.xstream.core.TreeMarshaller.start(TreeMarshaller.java:83) at com.thoughtworks.xstream.core.AbstractTreeMarshallingStrategy.marshal(AbstractTreeMarshallingStrategy.java:37) at com.thoughtworks.xstream.XStream.marshal(XStream.java:1303) at com.thoughtworks.xstream.XStream.marshal(XStream.java:1292) at com.thoughtworks.xstream.XStream.toXML(XStream.java:1265) at com.thoughtworks.xstream.XStream.toXML(XStream.java:1252) at hudson.plugins.jobConfigHistory.FileHistoryDao.hasDuplicateHistory(FileHistoryDao.java:1299) at hudson.plugins.jobConfigHistory.ComputerHistoryListener.onChange(ComputerHistoryListener.java:117) at hudson.plugins.jobConfigHistory.ComputerHistoryListener.onConfigurationChange(ComputerHistoryListener.java:69) at jenkins.util.Listeners.lambda$notify$0(Listeners.java:59) at jenkins.util.Listeners.notify(Listeners.java:70) at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:278) at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1705) at jenkins.model.Nodes$5.run(Nodes.java:279) at hudson.model.Queue._withLock(Queue.java:1401) at hudson.model.Queue.withLock(Queue.java:1275) at jenkins.model.Nodes.removeNode(Nodes.java:270) at jenkins.model.Jenkins.removeNode(Jenkins.java:2266) at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:91) at hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:61) at hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:45) at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:970) at hudson.model.Queue._withLock(Queue.java:1401) at hudson.model.Queue.withLock(Queue.java:1275) at hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:967) at hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:147) at hudson.model.AbstractCIBase$1.run(AbstractCIBase.java:255) at hudson.model.Queue._withLock(Queue.java:1401) at hudson.model.Queue.withLock(Queue.java:1275) at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:238) at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1705) at jenkins.model.Nodes$5.run(Nodes.java:279) at hudson.model.Queue._withLock(Queue.java:1401) at hudson.model.Queue.withLock(Queue.java:1275) at jenkins.model.Nodes.removeNode(Nodes.java:270) at jenkins.model.Jenkins.removeNode(Jenkins.java:2266) at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:91) at hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:61) at hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:45) at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:970) at hudson.model.Queue._withLock(Queue.java:1401) at hudson.model.Queue.withLock(Queue.java:1275) at hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:967) at hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:147) at hudson.model.AbstractCIBase$1.run(AbstractCIBase.java:255) at hudson.model.Queue._withLock(Queue.java:1401) at hudson.model.Queue.withLock(Queue.java:1275) at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:238) at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1705) at jenkins.model.Nodes$5.run(Nodes.java:279) at hudson.model.Queue._withLock(Queue.java:1401) at hudson.model.Queue.withLock(Queue.java:1275) at jenkins.model.Nodes.removeNode(Nodes.java:270) at jenkins.model.Jenkins.removeNode(Jenkins.java:2266) at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:91) at io.jenkins.plugins.orka.WaitSSHLauncher.deleteAgent(WaitSSHLauncher.java:58) at io.jenkins.plugins.orka.WaitSSHLauncher.launch(WaitSSHLauncher.java:45) at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:297) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) 2024-04-09 22:00:28.174+0000 [id=2688702] INFO o.j.p.cloudstats.CloudStatistics#getIdFor: No support for cloud-stats-plugin by class io.jenkins.plugins.orka.OrkaProvisionedAgent 2024-04-09 22:00:28.207+0000 [id=2688702] WARNING jenkins.util.Listeners#lambda$notify$0
Deploying VM failed with: HTTP Code: 500, Error: Internal error occurred: Requested CPU is not available in the cluster _Note: I'm not sure if the exception underneath it in the log is directly related to the Orka message_ 2024-04-10 00:32:35.570+0000 [id=2701897] WARNING i.j.plugins.orka.AgentTemplate#provision: Deploying VM failed with: HTTP Code: 500, Error: Internal error occurred: Requested CPU is not available in the cluster No available nodes with sufficient memory No node in `READY` state is available to deploy to. Run `orka3 nodes list --namespace orka-default` to check nodes state 2024-04-10 00:32:35.596+0000 [id=2701646] WARNING i.j.plugins.orka.AgentTemplate#provision: Deploying VM failed with: HTTP Code: 500, Error: Internal error occurred: Requested CPU is not available in the cluster No available nodes with sufficient memory No node in `READY` state is available to deploy to. Run `orka3 nodes list --namespace orka-default` to check nodes state 2024-04-10 00:32:40.865+0000 [id=2701607] WARNING h.i.i.InstallUncaughtExceptionHandler#handleException java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:170) at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:112) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) Caused: java.io.IOException

There's also this: 2024-04-09 22:00:28.129+0000 [id=2688702] INFO o.j.p.cloudstats.CloudStatistics#getIdFor: No support for cloud-stats-plugin by class io.jenkins.plugins.orka.OrkaProvisionedAgent

For the second one above, I guess it's possible that it's being generated as a result of us hitting capacity on the cluster but might be good to verify whether such a condition has happened today. Since we've been kicking off five release runs in parallel it's entirely possible this is a fairly unique condition :-)

sxa commented 4 weeks ago

@andrew-m-leonard @smlambert Are we still seeing the issues mentioned in the previous comment?

andrew-m-leonard commented 4 weeks ago

@adamfarley Have you seen any Mac problems? or is it all good now?

adamfarley commented 4 weeks ago

For the nio and net failures:

Looks like we're infrequently struggling with some of the multicast tests (only one instance of failure on JDK23, none elsewhere).

07:26:42 TEST: java/nio/channels/DatagramChannel/BasicMulticastTests.java 07:26:42 TEST: java/nio/channels/DatagramChannel/AdaptorMulticasting.java

https://ci.adoptium.net/job/Test_openjdk23_hs_extended.openjdk_x86-64_mac_testList_1/7/console

Both issues seem to be linked to this error: java.net.SocketException: Resource busy (setsockopt failed)

The reruns of those unit tests passed.

jdk_net tests seem to pass consistently across all JDK versions.