adoptium / infrastructure

This repo contains all information about machine maintenance.
Apache License 2.0
84 stars 100 forks source link

Replace Ubuntu 23.10 Scaleway machines with 24.04 LTS #3598

Open sxa opened 2 weeks ago

sxa commented 2 weeks ago

Existing machines will be out of support soon, so using the LTS release would be preferable now that Scaleway can provide Ubuntu 24.04

Part of https://github.com/adoptium/infrastructure/issues/3589

sxa commented 2 weeks ago

First machine setup and tests being run

Once the results are verified as good we can start migrating the others (currently 10 in total). Noting that Scaleway also now has the option of debian (unstable) and Fedora 37 so it may be worth provisioning one of each of those for testing too.

sxa commented 2 weeks ago

Added https://ci.adoptium.net/computer/test-rise-fedora37-riscv64-1 Running playbook now (various issues - covered in https://github.com/adoptium/infrastructure/issues/3599 AQA pipeline: https://ci.adoptium.net/job/AQA_Test_Pipeline/293/

sxa commented 2 weeks ago

Ubuntu 24.04 / JDK21 / riscv64 (Nightly)

Test suite Result ✅⚠️❌
sanity.functional
extended.functional Unrecognized VM option 'EnableExtendedHCR' from IllegalAccessProtectedMethodTest_0 suite Issue raised https://github.com/adoptium/aqa-tests/issues/5393 - regrind after potentail fix - Re-run of full test run at https://ci.adoptium.net/job/Test_openjdk21_hs_extended.functional_riscv64_linux/53/ (which will hopefully exclude the target) YES, PASS
special.functional
sanity.openjdk java_lang VarHandleTestAccessShort Grinder*100#10389 PASSED 98/100 Failure is test VarHandleTestAccessShort.testAccess("VarHandle -> Array", VarHandle -> Array):: java.lang.AssertionError: success weakCompareAndSetPlain short expected [true] but found [false]
extended.openjdk
sanity.system
extended.system
sanity.perf
extended.perf dacapo-fop_0 failure - core dump Fatal glibc error: pthread_mutex_lock.c:94 (___pthread_mutex_lock): assertion failed: mutex->__data.__owner == 0 Grinder*10#10390 PASSED 10/10

Fedora37 / JDK21 / riscv64 (nightly)

Test suite Result ✅⚠️❌
sanity.functional
extended.functional IllegalAccessProtectedMethodTest_0 Unrecognized VM option 'EnableExtendedHCR' Issue raised
special.functional
sanity.openjdk java/lang/Error in java_util MultipleProducersSingleConsumerLoops.java Grinder*100#10391 Passed 99/100
extended.openjdk 39 failures Grinder*10#10393 Only ran first test with hotspot_custom - running with jdk_custom at G#10407
sanity.system
extended.system
sanity.perf
extended.perf

Most of the extended.openjdk failures on F37 are Datagram/Multicast tests so are likely related to the network configuration on their deployed system (Maybe IPv6 related in some case?).

Ubuntu 24.04 / JDK17 / riscv64 (Nightly)

Test suite Result ✅⚠️❌
sanity.functional ⚠️ cmdLineTester_libpathTestRtfChild_0 failure due to missing libawt_xawt.so (Headless build) (xml file) Grinder re-run link. Related issue Grinder*10@10414 FAILED 10/10
extended.functional
special.functional
sanity.openjdk
extended.openjdk 14 jpackage failures (existing issue) plus failure in SSLSocketAlpnTest Grinder*100#10428 PASS 99/100
sanity.system
extended.system Failed LockingLoadTest_0 (Hung process) Re-grind*100@10412 PASSED 99/100
sanity.perf
extended.perf Failed renaissance-dec-tree_0 Crash - fatal error: refcount underflow Internal Error (symbol.cpp:335) Regrind*10@10413 PASSED 10/10

Fedora37 / JDK17 / riscv64 (Nightly)

Test suite Result ✅⚠️❌
sanity.functional Same cmdLineTester_libpathTestRtfChild_0 failure as Ubuntu JDk17
extended.functional IllegalAccessProtectedMethodTest_0 (J9 test failure) but didn't fail on Ubuntu? Maybe fixed via https://github.com/adoptium/aqa-tests/issues/5393
special.functional
sanity.openjdk ❌ Failed java/lang PublicMethodsTest.java (crash), java/util CurrencyTests.java and java/util SpinedBufferTest.java - Re-grinding*100#10429 ALL THREE PASSED 100/100
extended.openjdk ❌ 38 failures - similar to JDK21
sanity.system
extended.system
sanity.perf
extended.perf ❌ Terminating after 17h Running again Failed dacapo_jython_0 and renaissance-gauss-mix_0 - Re-running with 20 iterations
sxa commented 5 days ago

Four new machines provisioned and used to replace the 6-10 numbered ubuntu2310 systems. Installed via playbooks with a hosts entry test-rise-ubuntu2410-riscv64-[1:4] Two extras also provisioned for temurin-compliance that will be set up in parallel.

Noting:

sxa commented 4 days ago

Most of the problems above were resolved by switching to an Ubuntu 24.04 base with the version of python and ansible installed with the OS. The underlying message was An unknown error occurred: HTTPSConnection.__init__() got an unexpected keyword argument 'cert_file' which was introduced in ansible-core 2.12 and rendered it incompatible with Ubuntu 20.04's python 3.8. I had been using 2.13.3 installed via pip on Ubuntu 20.04.

From https://docs.python.org/3.12/library/http.client.html#http.client.HTTPSConnection Changed in version 3.12: The deprecated key_file, cert_file and check_hostname parameters have been removed.

Ansible reference: https://github.com/ansible/ansible/issues/83213#issuecomment-2100960459

sxa commented 4 days ago

All four Ubuntu 24.04 machines are now live in jenkins and will be used from now on. I have marked all of the 23.10 ones offline for now with an intention to run a full aqa test run on the 24.04 ones over the weekend then decomission the older ones on Monday, replacing them with more 24.04 ones.

Full list of the machines

sxa commented 4 days ago

aqa_test_pipelines submitted for -3 and -4:

sxa commented 2 hours ago

New 24.04 machines 5-7 created to replace 23.10 machines 3-5. Added to the PR at https://github.com/adoptium/infrastructure/pull/3627/commits/2cc5cf82dc03533d2f1ce56d4b9593a56fc13562