adoptium / infrastructure

This repo contains all information about machine maintenance.
Apache License 2.0
86 stars 101 forks source link

New Machine requirement: Solaris/x64 systems (Equinix replacement) #3347

Closed sxa closed 8 months ago

sxa commented 9 months ago

I need to request a new machine:

Please explain what this machine is needed for:

sxa commented 8 months ago

Noting that the licensing for ESXi was recently changed by Broadcom so it is likely that it will not be possible to utilise that for the replacement.

sxa commented 8 months ago

@steelhead31 @Haroon-Khel Have either of you you used Solaris VMs with the libvirt/kvm provider in vagrant instead of virtualbox?

Haroon-Khel commented 8 months ago

I have not

steelhead31 commented 8 months ago

Nor I, there are some libvirt vagrant boxes available on vagrantup though.

sxa commented 8 months ago
Ubuntu EFI secure boot warning with Azure trusted VMs │ UEFI Secure Boot requires additional configuration to work with third-party drivers. │ │ The system will assist you in configuring UEFI Secure Boot. To permit the use of │ third-party drivers, a new Machine-Owner Key (MOK) has been generated. This key now │ needs to be enrolled in your system's firmware. │ │ To ensure that this change is being made by you as an authorized user, and not by an │ attacker, you must choose a password now and then confirm the change after reboot using │ the same password, in both the "Enroll MOK" and "Change Secure Boot state" menus that │ will be presented to you when this system reboots. │ If you proceed but do not confirm the password upon reboot, Ubuntu will still be able │ to boot on your system but any hardware that requires third-party drivers to work │ correctly may not be usable. If you try to bring up a VM without additional work, then you'll get this error: > Error while connecting to Libvirt: Error making a connection to libvirt URI qemu:///system: Call to virConnectOpen failed: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory In theory this can be mitigated by: `/usr/src/linux-headers-6.5.0-1015-azure/scripts/sign-file sha256 /var/lib/shim-signed/mok/MOK.der /var/lib/shim-signed/mok/MOK.priv /var/lib/dkms/virtualbox/6.1.50/6.5.0-1015-azure/x86_64/module/vboxdrv.ko` but I haven't got that to work yet ... Probably because the MOK password hasn't been entered on startup (You're prompted to set up MOK during the install of virtualbox)

Using a "standard" VM of a D4 specification (which supports nested virtualisation) allows vagrant to work successfully without a reboot loop. Note that a d3as-V4 or B4ls_V2 will not work and gives the message Stderr: VBoxManage: error: AMD-V is not available (VERR_SVM_NO_SVM) when attempting to start the VM from Vagrant. Standard D16ds v4 (16 vcpus, 64 GiB memory) works ok.

To connect the default ssh configuration on the Ubuntu client will not work so you need to connect with:

Noting that vagrant ssh by default also uses -o LogLevel=FATAL -o Compression=yes -o IdentitiesOnly=yes -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null but those aren't mandatory

Steps to recreate

scp 150.239.60.120:/home/will/solaris10_homemade_v2.box . 
sudo apt-get -y update && sudo apt install -y joe vagrant virtualbox
vagrant box add --name solaris10 solaris10_homemade_v2.box
wget -O Vagrantfile https://raw.githubusercontent.com/adoptium/infrastructure/master/ansible/vagrant/Vagrantfile.Solaris10
vagrant up

Working system types (Numbers in brackets are cores/memGB):

Failing system types (reboot loop in the VM)

Failures with no VMX/SVM:

I'd ideally have a 8/16 or 16/32 but this seem only available in configurations that don't work from the ones I've found so far :-(

sxa commented 8 months ago

Created new system on dockerhost-azure-ubuntu2204-x64-1 which has had vagrant and virtualbox installed from the adoptium repositories. This machine has had ssh exposed via port 2200 on the host, although the algorithm requirements mean there are issues connecting to it. I have a set it up in jenkins using JNLP for now.

A build ran locally completed in about 20 minutes.

The AQA pipeline job has been run at https://ci.adoptium.net/job/AQA_Test_Pipeline/220/ although that may need a re-run since it was running during today's jenkins update. The "Second run" table below from job 221 is after the /etc/hosts fix and after the jenkins upgrade was fully complete:

Job First run Second run
sanity.openjdk link 😢 [1] link
extended.openjdk link 😢 [1] link 😢 (10 failures)
sanity.perf link link
extended.perf link link
sanity.system link link
extended.system link 😢 [2] link 😢 [2]
sanity.functional link link
extended.functional link link
special.functional link link
Key Description
Job passed
😢 Job completed with failures
Job fell over and didn't run to completion

[1] - Many of these were "unable to resolve hostname" errors - I have manually added azsol10b to /etc/hosts, although this may well get resolved on a reboot.

[2] - Message (Noting that /export/home is a 22Gb file system with 90% free at the start of a test job):

11:41:54  There is 2499 Mb free
11:41:54  
Test machine has only 2499 Mb free on drive containing /export/home/jenkins/workspace/11:41:54  
11:41:54  There must be at least 3Gb (3072Mb) free to be sure of capturing diagnostics
11:41:54  files in the event of a test failure.

Re-queuing extended.system after creating a dummy 1Gb file to fix the buggy space detection: https://ci.adoptium.net/job/Test_openjdk8_hs_extended.system_x86-64_solaris/376/console PASSED ✅

So we're left with the ten failures from extended.openjdk.

sxa commented 8 months ago

So we're left with the ten failures from extended.openjdk. Re-running the appropriate targets in Grinder:

Grinder machine Time Result
9047 azure-1 release 2h36m 9 failures
9048 esxi-bld-1 release 1h39m 2 failures: jdk_security3_0, jdk_tools_0
9049 esxi-test-1 release 1h42m 2 failures: jdk_security3_0, jdk_tools_0
9050 esxi-test-1 nightly n/a
9051 azure-1 nightly 1h42m
9052 esxi-test-1 nightly 3h23 5 failures
9053 esxi-test-1 nightly - Repeat for good measure
sxa commented 8 months ago

Starting over with a cleaner setup now that we have prototyped this. Both of the dockerhost machines have had a /home/solaris file system created alongside an appropriate user with enough space to host the VMs. The Vagrantfile is under a subsirectory of solaris' home with the same name as the machine. The vagrant processes will run as that user:

Host Guest
dockerhost-skytap-ubuntu2204-x64-1 build-skytap-solaris10-x64-1
dockerhost-azure-ubuntu2204-x64-1 test-skytap-solaris10-x64-1

Setup process is using the box we defined in the past (this is a repeat of the section from an earlier comment in here)

scp 150.239.60.120:/home/will/solaris10_homemade_v2.box . 
sudo apt-get -y update && sudo apt install -y joe vagrant virtualbox
vagrant box add --name solaris10 solaris10_homemade_v2.box
wget -O Vagrantfile https://raw.githubusercontent.com/adoptium/infrastructure/master/ansible/vagrant/Vagrantfile.Solaris10
vagrant up

Noting that I started getting issues with the audio driver:

Stderr: VBoxManage: error: Failed to construct device 'ichac97' instance #0 (VERR_CFGM_NOT_ENOUGH_SPACE)
VBoxManage: error: Details: code NS_ERROR_FAILURE (0x80004005), component ConsoleWrap, interface IConsole

This can be solved by disabling audio support the VirtualBox UI for the machine (Unclear why it started happening when it was previously ok on the Azure machine)

To connect to the machine use the following, after which you can enable an appropriate key for the root user via sudo, and adjust /etc/ssh/sshd_config to allow root logins without-password: ssh vagrant@127.0.0.1 -p 2222 -o HostKeyAlgorithms=ssh-rsa,ssh-dss,ecdsa-sha2-nistp256,ssh-ed25519 -o PubKeyAcceptedKeyTypes=ssh-rsa -i .vagrant/machines/adoptopenjdkSol10/virtualbox/private_key

Until we get jenkins able to ssh to these machines I am starting them with the following script:

#!/bin/sh
PATH=/usr/local/bin:/opt/csw/bin:/usr/lib/jvm/bell-jdk-11.0.18/bin:$PATH; export PATH
LD_PRELOAD_64=/usr/lib/jvm/fallocate.so; export LD_PRELOAD_64
while true; do
  java -jar agent.jar -url https://ci.adoptium.net/ -secret XXXXX -name "XXXXX" -workDir "/export/home/jenkins"
  sleep 300
done
sxa commented 8 months ago

Systems are live and operating as expected. Note to infra team: You can go the solaris user on the machine and from the machine's subdirectory use the ssh command in the previous comment to connect to the host. I've added the team's keys onto the machine too so you can get to it as the root user.

/etc/hosts had to be updated manually to have an entry for the hostname output - we should have the playbooks doing that if we can - hopefullly it won't disappear on restart since I've adjusted /etc/hostname accordingly.

This could do with being documented somewhere else but since they are operational (Other than https://github.com/adoptium/aqa-tests/issues/5127 which is being tracked in that issue) I'm closing this issue

steelhead31 commented 3 months ago

When creating an Azure VM that supports nested virtualization, the following restrictions are in place:

Must be a TYPE D or TYPE £ machine, of Version 3. Must only use the "Standard" security model, trusted launch should not be used/enabled.