Closed sxa closed 8 months ago
Noting that the licensing for ESXi was recently changed by Broadcom so it is likely that it will not be possible to utilise that for the replacement.
@steelhead31 @Haroon-Khel Have either of you you used Solaris VMs with the libvirt/kvm provider in vagrant instead of virtualbox?
I have not
Nor I, there are some libvirt vagrant boxes available on vagrantup though.
Using a "standard" VM of a D4 specification (which supports nested virtualisation) allows vagrant to work successfully without a reboot loop. Note that a d3as-V4 or B4ls_V2 will not work and gives the message Stderr: VBoxManage: error: AMD-V is not available (VERR_SVM_NO_SVM)
when attempting to start the VM from Vagrant. Standard D16ds v4 (16 vcpus, 64 GiB memory)
works ok.
To connect the default ssh configuration on the Ubuntu client will not work so you need to connect with:
ssh vagrant@127.0.0.1 -p 2222 -o HostKeyAlgorithms=ssh-rsa,ssh-dss,ecdsa-sha2-nistp256,ssh-ed25519 -o PubKeyAcceptedKeyTypes=ssh-rsa -i .vagrant/machines/adoptopenjdkSol10/virtualbox/private_key
(From the directory with the Vagrantfile since that's where the keys are relative to. We should probably see if some of these options for algorithms can be set in the Vagrantfile)Noting that vagrant ssh
by default also uses -o LogLevel=FATAL -o Compression=yes -o IdentitiesOnly=yes -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
but those aren't mandatory
Steps to recreate
scp 150.239.60.120:/home/will/solaris10_homemade_v2.box .
sudo apt-get -y update && sudo apt install -y joe vagrant virtualbox
vagrant box add --name solaris10 solaris10_homemade_v2.box
wget -O Vagrantfile https://raw.githubusercontent.com/adoptium/infrastructure/master/ansible/vagrant/Vagrantfile.Solaris10
vagrant up
Working system types (Numbers in brackets are cores/memGB):
Failing system types (reboot loop in the VM)
Failures with no VMX/SVM:
I'd ideally have a 8/16 or 16/32 but this seem only available in configurations that don't work from the ones I've found so far :-(
Created new system on dockerhost-azure-ubuntu2204-x64-1 which has had vagrant
and virtualbox
installed from the adoptium repositories. This machine has had ssh exposed via port 2200 on the host, although the algorithm requirements mean there are issues connecting to it. I have a set it up in jenkins using JNLP for now.
A build ran locally completed in about 20 minutes.
The AQA pipeline job has been run at https://ci.adoptium.net/job/AQA_Test_Pipeline/220/ although that may need a re-run since it was running during today's jenkins update. The "Second run" table below from job 221 is after the /etc/hosts fix and after the jenkins upgrade was fully complete:
Job | First run | Second run |
---|---|---|
sanity.openjdk | link 😢 [1] | link ✅ |
extended.openjdk | link 😢 [1] | link 😢 (10 failures) |
sanity.perf | link ✅ | link ✅ |
extended.perf | link ❌ | link ✅ |
sanity.system | link ❌ | link ✅ |
extended.system | link 😢 [2] | link 😢 [2] |
sanity.functional | link ❌ | link ✅ |
extended.functional | link ❌ | link ✅ |
special.functional | link ✅ | link ✅ |
Key | Description |
---|---|
✅ | Job passed |
😢 | Job completed with failures |
❌ | Job fell over and didn't run to completion |
[1] - Many of these were "unable to resolve hostname" errors - I have manually added azsol10b
to /etc/hosts, although this may well get resolved on a reboot.
[2] - Message (Noting that /export/home
is a 22Gb file system with 90% free at the start of a test job):
11:41:54 There is 2499 Mb free
11:41:54
Test machine has only 2499 Mb free on drive containing /export/home/jenkins/workspace/11:41:54
11:41:54 There must be at least 3Gb (3072Mb) free to be sure of capturing diagnostics
11:41:54 files in the event of a test failure.
Re-queuing extended.system after creating a dummy 1Gb file to fix the buggy space detection: https://ci.adoptium.net/job/Test_openjdk8_hs_extended.system_x86-64_solaris/376/console PASSED ✅
So we're left with the ten failures from extended.openjdk.
So we're left with the ten failures from extended.openjdk. Re-running the appropriate targets in Grinder:
Grinder | machine | Time | Result |
---|---|---|---|
9047 | azure-1 release | 2h36m | 9 failures |
9048 | esxi-bld-1 release | 1h39m | 2 failures: jdk_security3_0 , jdk_tools_0 |
9049 | esxi-test-1 release | 1h42m | 2 failures: jdk_security3_0 , jdk_tools_0 |
9050 | esxi-test-1 nightly | n/a | ✅ |
9051 | azure-1 nightly | 1h42m | ✅ |
9052 | esxi-test-1 nightly | 3h23 | 5 failures |
9053 | esxi-test-1 nightly | - | Repeat for good measure |
Starting over with a cleaner setup now that we have prototyped this. Both of the dockerhost machines have had a /home/solaris
file system created alongside an appropriate user with enough space to host the VMs. The Vagrantfile is under a subsirectory of solaris
' home with the same name as the machine. The vagrant processes will run as that user:
Host | Guest |
---|---|
dockerhost-skytap-ubuntu2204-x64-1 | build-skytap-solaris10-x64-1 |
dockerhost-azure-ubuntu2204-x64-1 | test-skytap-solaris10-x64-1 |
Setup process is using the box we defined in the past (this is a repeat of the section from an earlier comment in here)
scp 150.239.60.120:/home/will/solaris10_homemade_v2.box .
sudo apt-get -y update && sudo apt install -y joe vagrant virtualbox
vagrant box add --name solaris10 solaris10_homemade_v2.box
wget -O Vagrantfile https://raw.githubusercontent.com/adoptium/infrastructure/master/ansible/vagrant/Vagrantfile.Solaris10
vagrant up
Noting that I started getting issues with the audio driver:
Stderr: VBoxManage: error: Failed to construct device 'ichac97' instance #0 (VERR_CFGM_NOT_ENOUGH_SPACE)
VBoxManage: error: Details: code NS_ERROR_FAILURE (0x80004005), component ConsoleWrap, interface IConsole
This can be solved by disabling audio support the VirtualBox UI for the machine (Unclear why it started happening when it was previously ok on the Azure machine)
To connect to the machine use the following, after which you can enable an appropriate key for the root user via sudo, and adjust /etc/ssh/sshd_config
to allow root logins without-password
:
ssh vagrant@127.0.0.1 -p 2222 -o HostKeyAlgorithms=ssh-rsa,ssh-dss,ecdsa-sha2-nistp256,ssh-ed25519 -o PubKeyAcceptedKeyTypes=ssh-rsa -i .vagrant/machines/adoptopenjdkSol10/virtualbox/private_key
Until we get jenkins able to ssh to these machines I am starting them with the following script:
#!/bin/sh
PATH=/usr/local/bin:/opt/csw/bin:/usr/lib/jvm/bell-jdk-11.0.18/bin:$PATH; export PATH
LD_PRELOAD_64=/usr/lib/jvm/fallocate.so; export LD_PRELOAD_64
while true; do
java -jar agent.jar -url https://ci.adoptium.net/ -secret XXXXX -name "XXXXX" -workDir "/export/home/jenkins"
sleep 300
done
Systems are live and operating as expected. Note to infra team: You can go the solaris
user on the machine and from the machine's subdirectory use the ssh command in the previous comment to connect to the host. I've added the team's keys onto the machine too so you can get to it as the root
user.
/etc/hosts
had to be updated manually to have an entry for the hostname
output - we should have the playbooks doing that if we can - hopefullly it won't disappear on restart since I've adjusted /etc/hostname
accordingly.
This could do with being documented somewhere else but since they are operational (Other than https://github.com/adoptium/aqa-tests/issues/5127 which is being tracked in that issue) I'm closing this issue
When creating an Azure VM that supports nested virtualization, the following restrictions are in place:
Must be a TYPE D or TYPE £ machine, of Version 3. Must only use the "Standard" security model, trusted launch should not be used/enabled.
I need to request a new machine:
Please explain what this machine is needed for: