adoptium / infrastructure

This repo contains all information about machine maintenance.
Apache License 2.0
86 stars 101 forks source link

Clean up jenkins nodes which are not contactable over ssh #3486

Closed sxa closed 6 months ago

sxa commented 7 months ago

On an ssh failure, jenkins is trying to reconnect to machines about once every half hour. We should analyse the list and ensure we know why each is not contactable, and determine whether to remove it, or remediate it, or whether it is a known temporary outage. There are quite a few, particularly in the test-docker set, so I'm going to tag @Haroon-Khel on this one. This was identified through other work to clear up the jenkins system logs.

Machines which have been non-contactable over ssh by jenkins today 36 build-alibaba-ubuntu1804-armv8-1 36 build-alibaba-ubuntu1804-armv8-2 57 build-spearhead-freebsd12-x64-1 57 C3jenkins 4 dockerhost-azure-ubuntu2204-x64-2 13 dockerhost-marist-ubuntu2204-s390x-1 56 dockerhost-skytap-ubuntu2204-x64-1 57 test-alibaba-ubuntu1804-armv8-1 36 test-alibaba-ubuntu1804-armv8-2 36 test-aws-ubuntu2004-x64-1 1 test-docker-alpine314-armv8-3 1 test-docker-alpine314-x64-1 1 test-docker-alpine314-x64-2 1 test-docker-alpine317-x64-1 1 test-docker-alpine317-x64-2 1 test-docker-alpine319-armv8-1 56 test-docker-alpine319-armv8-2 51 test-docker-alpine319-armv8-3 52 test-docker-alpine319-armv8-4 36 test-docker-alpine319-x64-1 36 test-docker-alpine319-x64-2 55 test-docker-alpine319-x64-3 36 test-docker-centos7-x64-1 1 test-docker-centos8-armv8-1 1 test-docker-centos8-x64-1 62 test-docker-centos8-x64-2 1 test-docker-debain12-armv8l-1 1 test-docker-debian11-x64-1 1 test-docker-debian11-x64-2 36 test-docker-debian12-x64-1 36 test-docker-debian12-x64-2 21 test-docker-debian12-x64-3 1 test-docker-fedora35-x64-1 62 test-docker-fedora35-x64-2 1 test-docker-fedora37-x64-1 1 test-docker-fedora37-x64-2 1 test-docker-fedora37-x64-3 1 test-docker-fedora39-armv8l-1 36 test-docker-fedora39-x64-1 13 test-docker-sles12-s390x-1 1 test-docker-sles15-armv8l-1 13 test-docker-sles15-s390x-1 1 test-docker-ubi8-x64-1 62 test-docker-ubi8-x64-2 36 test-docker-ubi8-x64-3 1 test-docker-ubuntu1804-armv8l-4 1 test-docker-ubuntu2004-armv7l-1 1 test-docker-ubuntu2004-armv7l-2 1 test-docker-ubuntu2004-armv7l-3 1 test-docker-ubuntu2004-armv7l-4 1 test-docker-ubuntu2004-armv7l-5 1 test-docker-ubuntu2004-armv7l-6 6 test-docker-ubuntu2004-armv8l-1 55 test-docker-ubuntu2004-armv8l-2 55 test-docker-ubuntu2004-armv8l-3 1 test-docker-ubuntu2004-x64-1 1 test-docker-ubuntu2004-x64-2 36 test-docker-ubuntu2004-x64-3 2 test-docker-ubuntu2004-x64-4 1 test-docker-ubuntu2204-armv8-1 1 test-docker-ubuntu2204-armv8-2 55 test-docker-ubuntu2204-armv8-3 1 test-docker-ubuntu2204-armv8-4 6 test-docker-ubuntu2204-armv8l-2 1 test-docker-ubuntu2204-x64-1 62 test-docker-ubuntu2204-x64-2 1 test-docker-ubuntu2204-x64-3 36 test-docker-ubuntu2204-x64-4 36 test-docker-ubuntu2204-x64-5 40 test-docker-ubuntu2204-x64-6 52 test-docker-ubuntu2310-armv8l-1 57 test-equinix_esxi-ubuntu2204-x64-2 57 test-ibmcloud-rhel6-x64-1 43 test-macincloud-macos1201-x64-1 43 test-macincloud-macos1201-x64-2 57 test-osuosl-aix72-ppc64-5 51 test-osuosl-ubuntu1604-ppc64le-3 57 test-osuosl-ubuntu1604-ppc64le-4 36 test-packet-ubuntu1604-armv8-2-OFF 51 test-rise-debian12-riscv64-4 57 test-rise-debian12-riscv64-9 35 trss-node
sxa commented 7 months ago

A number of these are the ones hosted on the equinix machines which are to be decommissioned as part of https://github.com/adoptium/infrastructure/issues/3292:

I guess the docker images have been shut down on those hosts as well as being marked offline which is why jenkins is still trying to connect to them.

sxa commented 7 months ago

It looks like many of the ones with just one entry in the log are ones that have been marked offline in the jenkins UI.

The following are on dockerhost.dockerhost-equinix-ubuntu2004-x64-1 and can now be decommissioned - the machines on the ubuntu2204 have all been removed already:

These have all now been removed from jenkins.

sxa commented 7 months ago

Remaining test-docker machines that are not contactable:

@Haroon-Khel Do you know why the ones marked Altra (dockerhost-equinix Arm64 systems) and the Azure ones here are offline - is that expected?

sxa commented 7 months ago

I've removed the alibaba machines from jenkins. They are still in the inventory file for now.

Jenkins agent node definitions have been backed up to alibababnodes.tar.gz in the nodes directory on the sever in case their information is required in the future. Ditto for the trss-node which was pointing to the old server on AWS

sxa commented 7 months ago

Remaining test-docker machines that are not contactable:

Noting that these try to connect about once every 20 minutes in a failure case, and take a varying amount of time to fail the connection, up to 825s

Haroon-Khel commented 7 months ago

Of the offline machines in https://github.com/adoptium/infrastructure/issues/3486#issuecomment-2031514284 Im seeing alot of bash: line 1: /usr/lib/jvm/jdk17/bin/java: No such file or directory

It seems on the dockerhosts, the ports have been changed?

root@dockerhost-azure-ubuntu2204-x64-2:~# docker ps | grep 32771
e60c862b5614   aqa_alp319     "/usr/sbin/sshd -D"   2 weeks ago   Up 6 days   0.0.0.0:32768->22/tcp, :::32768->22/tcp           ALP319.32771
2d2c4cbe2944   aqa_u2004      "/usr/sbin/sshd -D"   2 weeks ago   Up 6 days   0.0.0.0:32771->22/tcp, :::32771->22/tcp           U2004.32768

I wonder what caused this

sxa commented 7 months ago

The one I've just looked at seemed to be trying to use jdk21 instead of jdk17 which is on the machine so maybe there is some inconsistency there. If it's not that on your machine, maybe just double check it's got the JDK for the correct architecture e.g.

root@dockerhost-azure-ubuntu2204-x64-2:~# docker exec U2004.32768 file /usr/lib/jvm/jdk17/bin/java
/usr/lib/jvm/jdk17/bin/java: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.18, not stripped
root@dockerhost-azure-ubuntu2204-x64-2:~# 
Haroon-Khel commented 7 months ago

The ports were rearranged such that alpine nodes became debian/ubuntu/fedora nodes. Alpine uses jdk21 while the others use 17 hence the confusion in jenkins. We use 21 on x64 and arm64 alpine because there is no arm64 alpine jdk17 binary

sxa commented 7 months ago

Also noting that we're getting Attempting to reconnect test-ibmcloud-rhel6-x64-1 for a machine which has been marked offline in the jenkins UI just now which is "somewhat unexpected" sine there are no obvious connection issues in the log so assume this is just a jenkins oddity ...

EDIT: Noting that the SSH Launch of message does NOT come up for these machines, so that is the better message to look for.

sxa commented 7 months ago

The ports were rearranged such that alpine nodes became debian/ubuntu/fedora nodes. Alpine uses jdk21 while the others use 17 hence the confusion in jenkins. We use 21 on x64 and arm64 alpine because there is no arm64 alpine jdk17 binary

👍🏻

We should consider a migration of everything up to 21 where possible (arm32 and Solaris being the exceptions, although arm32 could have an ea-beta build but I'd rather leave those at 17) Ref https://github.com/adoptium/infrastructure/issues/3442#issuecomment-1994221717

sxa commented 7 months ago

Noting that as per https://github.com/adoptium/infrastructure/issues/1843#issuecomment-765288207 the machine test-aws-ubuntu2004-x64-1 has been decommissioned so I'll remove that from jenkins too. Similarly https://github.com/adoptium/infrastructure/pull/2150/files removed test-osuosl-ubuntu2004-ppc64le-[34] so they are now removed too.

sxa commented 7 months ago

Other than the RISE ones which are offline due to the administrator being away last week, we are left with just two systems showing recurring problems today:

sxa commented 6 months ago

test-docker-ubuntu2004-x64-4 has been rebuilt and now works.

I'm seeing four in the log now but these are the containers on the Skytap x64 dockerhost which is expired its credits again despite the reduction in size of that system which was put in place for this month:

sxa commented 6 months ago

Since the skytap machine is down to 6 cores I'm deleting all of the above agents other than debian12 and UBI8 from the machine

sxa commented 6 months ago

Closing on the basis that all of these have been resolved other than the Skytap x64 node which is a "known issue"