adoptium / infrastructure

This repo contains all information about machine maintenance.
Apache License 2.0
86 stars 101 forks source link

Move Marist machines to the self-service provisioning #2673

Closed sxa closed 2 years ago

sxa commented 2 years ago

To avoid having to go through support for any requests on our Marist systems, they have been trialling a self-service interface for their machines and it is ready to be used as the primary method for provisioning our machines. #2267 has machines which have been provisioned through the new interface and we should start migrating our existing systems across to this too.

The first step will be to ensure we have capacity in the system (At the moment the account I'm using only has 4 machine slots available) and then start duplicating the existing machines in it, followed by decomissioning the existing ones. We will likely look at having at least one dockerhost system in order to have a wider range of distributions tested for Linux/s390x (Subject to availability...)

Systems ready for installation:

sxa commented 2 years ago

I'm going to use this as a conclusive verification of a number of other infrastructure PRs that we have in flight just now, so I won't run the playbooks until after they are merged:

Haroon-Khel commented 2 years ago

For future reference before syncing inventories in awx you have to update the project source first in order for awx to have the latest inventory file. I assumed the syncing inventory process automatically pulled the latest inventory file.

Running https://awx2.adoptopenjdk.net/#/jobs/playbook/137?job_search=page_size:20;order_by:-finished;not__launch_type:sync on test-marist-rhel8-s390x-2 as a prelim playbook run

Haroon-Khel commented 2 years ago

Failed at the installation of systemtap-sdt-devel

I've created a new job in awx which I can use for debugging/testing. It deploys my own branch, https://github.com/Haroon-Khel/openjdk-infrastructure/tree/awx.debug, which so far the only change is systemtap-sdt-devel commented out

https://awx2.adoptopenjdk.net/#/jobs/playbook/143?job_search=page_size:20;order_by:-finished;not__launch_type:sync

Haroon-Khel commented 2 years ago

test-marist-rhel8-s390x-2 is actually a SLES15 machine

test-marist-rhel8-s390x-2:~ # cat /etc/os-release
NAME="SLES"
VERSION="15-SP2"
VERSION_ID="15.2"
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP2"
ID="sles"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:15:sp2"

And test-marist-sles15-s390x-2 is Rhel 8

[root@testrhel8 ~]# cat /etc/os-release 
NAME="Red Hat Enterprise Linux"
VERSION="8.6 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.6"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.6 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.6
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.6"
Haroon-Khel commented 2 years ago

Failed at downloading Ant

TASK [ant : Download Apache Ant binaries] **************************************
fatal: [test-marist-rhel8-s390x-2]: FAILED! => {"changed": false, "dest": "/tmp/", "elapsed": 0, "gid": 0, "group": "root", "mode": "01777", "msg": "Request failed: <urlopen error unknown url type: https>", "owner": "root", "size": 255, "state": "directory", "uid": 0, "url": "https://archive.apache.org/dist/ant/binaries/apache-ant-1.10.5-bin.zip"}
sxa commented 2 years ago

Failed at the installation of systemtap-sdt-devel

Presumably that's only on a subset of the OSs?

sxa commented 2 years ago

Tried deploying to just the RHEL79 build machines - hit https://github.com/adoptium/infrastructure/issues/2700 Tried deploying to test-marist-ubuntu2204 system - failed because gcc7 PR has not yet been merged Tried deploying to the RHEL79 build machines skipping the docker tag Redeploy to RHEL79 after removing /etc/yum.repos.d/docker.repo as that was already in place and preventing yum update PASSED Deploying to all test-marist systems (With docker bypassed to be safe for now)

PLAY RECAP *********************************************************************
test-marist-rhel7-s390x-1  : ok=221  changed=100  unreachable=0    failed=0    skipped=307  rescued=0    ignored=1   
test-marist-rhel7-s390x-2  : ok=218  changed=99   unreachable=0    failed=0    skipped=303  rescued=0    ignored=1   
test-marist-rhel8-s390x-1  : ok=18   changed=5    unreachable=0    failed=1    skipped=34   rescued=0    ignored=0   
test-marist-rhel8-s390x-2  : ok=19   changed=1    unreachable=0    failed=1    skipped=27   rescued=0    ignored=0   
test-marist-sles12-s390x-1 : ok=12   changed=2    unreachable=0    failed=1    skipped=25   rescued=0    ignored=0   
test-marist-sles12-s390x-2 : ok=0    changed=0    unreachable=1    failed=0    skipped=0    rescued=0    ignored=0   
test-marist-sles15-s390x-1 : ok=135  changed=18   unreachable=0    failed=0    skipped=377  rescued=0    ignored=0   
test-marist-sles15-s390x-2 : ok=18   changed=7    unreachable=0    failed=1    skipped=34   rescued=0    ignored=0   
test-marist-ubuntu1604-s390x-1 : ok=162  changed=28   unreachable=0    failed=0    skipped=349  rescued=0    ignored=0   
test-marist-ubuntu1804-s390x-1 : ok=12   changed=1    unreachable=0    failed=1    skipped=24   rescued=0    ignored=0   
test-marist-ubuntu1804-s390x-2 : ok=12   changed=1    unreachable=0    failed=1    skipped=24   rescued=0    ignored=0   
test-marist-ubuntu1804-s390x-3 : ok=111  changed=18   unreachable=0    failed=1    skipped=268  rescued=0    ignored=0   
test-marist-ubuntu1804-s390x-4 : ok=194  changed=85   unreachable=0    failed=0    skipped=317  rescued=0    ignored=0   
test-marist-ubuntu2004-s390x-1 : ok=186  changed=74   unreachable=0    failed=0    skipped=325  rescued=0    ignored=0   
test-marist-ubuntu2204-s390x-1 : ok=22   changed=1    unreachable=0    failed=1    skipped=32   rescued=0    ignored=0   

Failures in Ubuntu 22.04 (Will be gcc-7 - PR ready), Ubuntu 18, the new SLES15 and the old SLES12, and RHEL8. Those will need further investigation. I'm pausing for now so someone else can take over, as it's the build machines I really needed :-) But we havn't hit any problems due to the intrusion prevention on those systems, which is promising.

sxa commented 2 years ago

Failed at the installation of systemtap-sdt-devel

This is specific to SLES15. It is installed on the -1 sles15 machine so it's not entirely clear why this message is appearing on the other machines, unless it was bypassed . libc.so.6 is on the machine:

test-marist-sles15-s390x-2:~ # ls -l /lib64/libc.so.6
lrwxrwxrwx 1 root root 12 Nov  5  2021 /lib64/libc.so.6 -> libc-2.26.so
test-marist-sles15-s390x-2:~ # zypper install systemtap-sdt-devel
Refreshing service 'SMT-http_lxslsmt'.
Loading repository data...
Reading installed packages...
Resolving package dependencies...

Problem: nothing provides 'libc.so.6(GLIBC_2.27)(64bit)' needed by the to be installed systemtap-4.6-151.d_t.3.s390x
 Solution 1: do not install systemtap-sdt-devel-4.6-151.d_t.3.s390x
 Solution 2: break systemtap-4.6-151.d_t.3.s390x by ignoring some of its dependencies

Choose from above solutions by number or cancel [1/2/c/d/?] (c): c
test-marist-sles15-s390x-2:~ # 
sxa commented 2 years ago

test-marist-ubuntu-1804-s390x- systems 1 and 2 had these entries in /etc/hosts:

91.189.95.85 ppa.launchpad.net
91.189.88.142 ports.ubuntu.com

This was preventing them from updating themselves - presumably implemented to bypass a temporary problem at some point - the date stamp on the file was:

-rw-r--r-- 1 root root 487 Apr 22  2021 /etc/hosts

I've commented those lines out of both machines now which should avoid this problem:

root@test-marist-ubuntu1804-s390x-2:~# apt-get update
Err:1 http://ports.ubuntu.com/ubuntu-ports bionic InRelease
  Could not connect to ports.ubuntu.com:80 (91.189.88.142), connection timed out
Err:2 http://ports.ubuntu.com/ubuntu-ports bionic-updates InRelease
  Unable to connect to ports.ubuntu.com:http:
Err:3 http://ports.ubuntu.com/ubuntu-ports bionic-backports InRelease
  Unable to connect to ports.ubuntu.com:http:
Err:4 http://ports.ubuntu.com/ubuntu-ports bionic-security InRelease
  Unable to connect to ports.ubuntu.com:http:
Reading package lists... Done                      
W: Failed to fetch http://ports.ubuntu.com/ubuntu-ports/dists/bionic/InRelease  Could not connect to ports.ubuntu.com:80 (91.189.88.142), connection timed out
W: Failed to fetch http://ports.ubuntu.com/ubuntu-ports/dists/bionic-updates/InRelease  Unable to connect to ports.ubuntu.com:http:
W: Failed to fetch http://ports.ubuntu.com/ubuntu-ports/dists/bionic-backports/InRelease  Unable to connect to ports.ubuntu.com:http:
W: Failed to fetch http://ports.ubuntu.com/ubuntu-ports/dists/bionic-security/InRelease  Unable to connect to ports.ubuntu.com:http:
W: Some index files failed to download. They have been ignored, or old ones used instead.
sxa commented 2 years ago

sles12-2 was missing the AWX ssh key - now fixed so that should work now. RHEL8 looks to be trying to install some of the 31-bit (s390) packages which we probably don't need.

steelhead31 commented 2 years ago

@sxa want me to pick up the systemtap-sdt-devel on test-marist-sles15-s390x-2 ?

sxa commented 2 years ago

Sure - please co-ordinate with Haroon in slack.

Haroon-Khel commented 2 years ago

That would be helpful @steelhead31 Thanks

sxa commented 2 years ago

Ubuntu 22.04 looking happier now that https://github.com/adoptium/infrastructure/pull/2691 is merged.

steelhead31 commented 2 years ago

The sles15 playbooks run better using python 3 as the ansible_python_interpreter ( which can be specified in the inventory ), and also an issue with the ipv6 configuration on test-marist-sles15-s390x-2 has been resolved by disabling ipv6 as shown below.

1. Edit the file sysctl.conf by executing the command sudo vi /etc/sysctl.conf
2. Add the below 2 lines to the file
  net.ipv6.conf.all.disable_ipv6 = 1
  net.ipv6.conf.default.disable_ipv6 = 1
3. Save and execute the command "sudo sysctl -p" . This would re-load the settings and disables ipv6 address.
4. Execute the command ip a | grep inet - this should only show ipv4 addresses
sxa commented 2 years ago

From Marist: "Let me know when fully migrated and I can remove the old servers as we are targeting end of September to power off the old storage servers."

sxa commented 2 years ago

@Haroon-Khel Looks like there may be some problems that need addressing: https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk11_hs_sanity.openjdk_s390x_linux/651

Certainly a subset of them are in the compression code (we've seen issues there elsewhere - at least on Ubuntu 20.04 - that run was on 22.04) and if all the failures are related to that it will be good to confirm which distributions and versions it happens on, as there will be implications elsewhere.

Haroon-Khel commented 2 years ago

Nagios should be working on all of the new marist machines expect for test-marist-rhel8-s390x-2 due to No package nagios-plugins-all available. Should have a quick solution. @steelhead31 Can you check if the marist machines appear in that view you showed earlier?

sxa commented 2 years ago

Added docker tag onto test-marist-ubuntu2204-s390x-1 as openjdk_build_docker_multiarch builds were getting stuck due to lack of suitable labels. The dockerhost-marist machine is currently unsuitable as despite being in jenkins it appears that it cannot run docker as the jenkins user (See this log from when I tried to add the tag to that machine)

sxa commented 2 years ago

Request for Eclipse to set up two machines for Temurin Compliance:

https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/issues/1917

sxa commented 2 years ago

NOTE: I've brought docker-marist-ubuntu1604-s390x-1 back online in jenkins for now since that one (why not others?) was causing 'temporarily offline in jenkins' messages to appear in the bot channel, but I've switched the docker label to dockerX

We'll need to understand as part of #1716 why the other marist machines which we have disabled (marked offline in jenkins) are not giving the same notifications e.g. https://ci.adoptopenjdk.net/computer/build%2Dmarist%2Drhel77%2Ds390x%2D1/ and https://ci.adoptopenjdk.net/computer/test%2Dmarist%2Dubuntu1804%2Ds390x%2D1/ (and all the other "old" ones)

sxa commented 2 years ago

Temurin Compliance systems still awaiting setup, but otherwise this is complete. Old machines will need to be deprovisioned, but that is due to be done later.

sxa commented 2 years ago

@Haroon-Khel @steelhead31 Can we remove the old machines from Nagios, Jenkins and the inventory files please as they have now been deprovisioned. Full list as follows (Some of these were temporary systems so if you can't find them, that's not a problem):

steelhead31 commented 2 years ago

Will do, has the ansible inventory been updated with the new ip's / hostnames ?, Im starting work on fixing the discrepancies between nagios and ansible today.

sxa commented 2 years ago

Will do, has the ansible inventory been updated with the new ip's / hostnames ?

Yep the new ones have been live for a few weeks: https://github.com/adoptium/infrastructure/pull/2690/files

In theory removing the ones listed above should only leave the s390x ones added in that PR.

steelhead31 commented 2 years ago

All have now been removed from nagios.

sxa commented 2 years ago

That'll clear up the slack channel a bit then ;-)

sxa commented 2 years ago

The old machines have all been relieved or their duties and returned to Marist.

There is still some more work required to fix some issues that have shown up during this release cycle under #2807 but those can be covered under that issue. The old TCK machines will be decomissioned this week too.

Haroon-Khel commented 1 year ago

Removing the following machines from inventory.yml and jenkins as they've been decommissioned

* https://ci.adoptopenjdk.net/computer/test-marist-sles15-s390x-1/
* https://ci.adoptopenjdk.net/computer/build-marist-rhel77-s390x-1/
* https://ci.adoptopenjdk.net/computer/build-marist-rhel77-s390x-2/
* https://ci.adoptopenjdk.net/computer/test-marist-ubuntu1604-s390x-1/
* https://ci.adoptopenjdk.net/computer/test-marist-ubuntu1804-s390x-1/
* https://ci.adoptopenjdk.net/computer/test-marist-ubuntu1804-s390x-2/
* https://ci.adoptopenjdk.net/computer/test-marist-ubuntu1804-s390x-3/
* https://ci.adoptopenjdk.net/computer/test-marist-ubuntu1804-s390x-4/
* https://ci.adoptopenjdk.net/computer/docker-marist-ubuntu1604-s390x-1/