Closed sxa closed 6 months ago
A number of these are the ones hosted on the equinix machines which are to be decommissioned as part of https://github.com/adoptium/infrastructure/issues/3292:
I guess the docker images have been shut down on those hosts as well as being marked offline which is why jenkins is still trying to connect to them.
It looks like many of the ones with just one entry in the log are ones that have been marked offline in the jenkins UI.
The following are on dockerhost.dockerhost-equinix-ubuntu2004-x64-1 and can now be decommissioned - the machines on the ubuntu2204 have all been removed already:
These have all now been removed from jenkins.
Remaining test-docker machines that are not contactable:
@Haroon-Khel Do you know why the ones marked Altra (dockerhost-equinix Arm64 systems) and the Azure ones here are offline - is that expected?
I've removed the alibaba machines from jenkins. They are still in the inventory file for now.
Jenkins agent node definitions have been backed up to alibababnodes.tar.gz
in the nodes
directory on the sever in case their information is required in the future.
Ditto for the trss-node
which was pointing to the old server on AWS
Remaining test-docker machines that are not contactable:
Noting that these try to connect about once every 20 minutes in a failure case, and take a varying amount of time to fail the connection, up to 825s
Of the offline machines in https://github.com/adoptium/infrastructure/issues/3486#issuecomment-2031514284 Im seeing alot of bash: line 1: /usr/lib/jvm/jdk17/bin/java: No such file or directory
It seems on the dockerhosts, the ports have been changed?
root@dockerhost-azure-ubuntu2204-x64-2:~# docker ps | grep 32771
e60c862b5614 aqa_alp319 "/usr/sbin/sshd -D" 2 weeks ago Up 6 days 0.0.0.0:32768->22/tcp, :::32768->22/tcp ALP319.32771
2d2c4cbe2944 aqa_u2004 "/usr/sbin/sshd -D" 2 weeks ago Up 6 days 0.0.0.0:32771->22/tcp, :::32771->22/tcp U2004.32768
I wonder what caused this
The one I've just looked at seemed to be trying to use jdk21
instead of jdk17
which is on the machine so maybe there is some inconsistency there. If it's not that on your machine, maybe just double check it's got the JDK for the correct architecture e.g.
root@dockerhost-azure-ubuntu2204-x64-2:~# docker exec U2004.32768 file /usr/lib/jvm/jdk17/bin/java
/usr/lib/jvm/jdk17/bin/java: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.18, not stripped
root@dockerhost-azure-ubuntu2204-x64-2:~#
The ports were rearranged such that alpine nodes became debian/ubuntu/fedora nodes. Alpine uses jdk21 while the others use 17 hence the confusion in jenkins. We use 21 on x64 and arm64 alpine because there is no arm64 alpine jdk17 binary
Also noting that we're getting Attempting to reconnect test-ibmcloud-rhel6-x64-1
for a machine which has been marked offline in the jenkins UI just now which is "somewhat unexpected" sine there are no obvious connection issues in the log so assume this is just a jenkins oddity ...
EDIT: Noting that the SSH Launch of
message does NOT come up for these machines, so that is the better message to look for.
The ports were rearranged such that alpine nodes became debian/ubuntu/fedora nodes. Alpine uses jdk21 while the others use 17 hence the confusion in jenkins. We use 21 on x64 and arm64 alpine because there is no arm64 alpine jdk17 binary
👍🏻
We should consider a migration of everything up to 21 where possible (arm32 and Solaris being the exceptions, although arm32 could have an ea-beta build but I'd rather leave those at 17) Ref https://github.com/adoptium/infrastructure/issues/3442#issuecomment-1994221717
Noting that as per https://github.com/adoptium/infrastructure/issues/1843#issuecomment-765288207 the machine test-aws-ubuntu2004-x64-1 has been decommissioned so I'll remove that from jenkins too. Similarly https://github.com/adoptium/infrastructure/pull/2150/files removed test-osuosl-ubuntu2004-ppc64le-[34] so they are now removed too.
Other than the RISE ones which are offline due to the administrator being away last week, we are left with just two systems showing recurring problems today:
test-docker-ubuntu2004-x64-4
as described in https://github.com/adoptium/infrastructure/issues/3352#issuecomment-2039983929test-packet-ubuntu1604-armv8-2-OFF
which looks to have been hosted on one of the ThunderX system decomissioned as part of https://github.com/adoptium/infrastructure/issues/1897#issuecomment-774261646. It's description is 96 x ARMv8 | 32GB RAM | 240 Gb SSD | hosted by packet.net (XJ) - NOW RUNNING Ubuntu 18.04 as per infra1897
. It was defined at 147.75.193.234. I've removed it from jenkins now.test-docker-ubuntu2004-x64-4 has been rebuilt and now works.
I'm seeing four in the log now but these are the containers on the Skytap x64 dockerhost which is expired its credits again despite the reduction in size of that system which was put in place for this month:
Since the skytap machine is down to 6 cores I'm deleting all of the above agents other than debian12 and UBI8 from the machine
Closing on the basis that all of these have been resolved other than the Skytap x64 node which is a "known issue"
On an ssh failure, jenkins is trying to reconnect to machines about once every half hour. We should analyse the list and ensure we know why each is not contactable, and determine whether to remove it, or remediate it, or whether it is a known temporary outage. There are quite a few, particularly in the
test-docker
set, so I'm going to tag @Haroon-Khel on this one. This was identified through other work to clear up the jenkins system logs.Machines which have been non-contactable over ssh by jenkins today
36 build-alibaba-ubuntu1804-armv8-1 36 build-alibaba-ubuntu1804-armv8-2 57 build-spearhead-freebsd12-x64-1 57 C3jenkins 4 dockerhost-azure-ubuntu2204-x64-2 13 dockerhost-marist-ubuntu2204-s390x-1 56 dockerhost-skytap-ubuntu2204-x64-1 57 test-alibaba-ubuntu1804-armv8-1 36 test-alibaba-ubuntu1804-armv8-2 36 test-aws-ubuntu2004-x64-1 1 test-docker-alpine314-armv8-3 1 test-docker-alpine314-x64-1 1 test-docker-alpine314-x64-2 1 test-docker-alpine317-x64-1 1 test-docker-alpine317-x64-2 1 test-docker-alpine319-armv8-1 56 test-docker-alpine319-armv8-2 51 test-docker-alpine319-armv8-3 52 test-docker-alpine319-armv8-4 36 test-docker-alpine319-x64-1 36 test-docker-alpine319-x64-2 55 test-docker-alpine319-x64-3 36 test-docker-centos7-x64-1 1 test-docker-centos8-armv8-1 1 test-docker-centos8-x64-1 62 test-docker-centos8-x64-2 1 test-docker-debain12-armv8l-1 1 test-docker-debian11-x64-1 1 test-docker-debian11-x64-2 36 test-docker-debian12-x64-1 36 test-docker-debian12-x64-2 21 test-docker-debian12-x64-3 1 test-docker-fedora35-x64-1 62 test-docker-fedora35-x64-2 1 test-docker-fedora37-x64-1 1 test-docker-fedora37-x64-2 1 test-docker-fedora37-x64-3 1 test-docker-fedora39-armv8l-1 36 test-docker-fedora39-x64-1 13 test-docker-sles12-s390x-1 1 test-docker-sles15-armv8l-1 13 test-docker-sles15-s390x-1 1 test-docker-ubi8-x64-1 62 test-docker-ubi8-x64-2 36 test-docker-ubi8-x64-3 1 test-docker-ubuntu1804-armv8l-4 1 test-docker-ubuntu2004-armv7l-1 1 test-docker-ubuntu2004-armv7l-2 1 test-docker-ubuntu2004-armv7l-3 1 test-docker-ubuntu2004-armv7l-4 1 test-docker-ubuntu2004-armv7l-5 1 test-docker-ubuntu2004-armv7l-6 6 test-docker-ubuntu2004-armv8l-1 55 test-docker-ubuntu2004-armv8l-2 55 test-docker-ubuntu2004-armv8l-3 1 test-docker-ubuntu2004-x64-1 1 test-docker-ubuntu2004-x64-2 36 test-docker-ubuntu2004-x64-3 2 test-docker-ubuntu2004-x64-4 1 test-docker-ubuntu2204-armv8-1 1 test-docker-ubuntu2204-armv8-2 55 test-docker-ubuntu2204-armv8-3 1 test-docker-ubuntu2204-armv8-4 6 test-docker-ubuntu2204-armv8l-2 1 test-docker-ubuntu2204-x64-1 62 test-docker-ubuntu2204-x64-2 1 test-docker-ubuntu2204-x64-3 36 test-docker-ubuntu2204-x64-4 36 test-docker-ubuntu2204-x64-5 40 test-docker-ubuntu2204-x64-6 52 test-docker-ubuntu2310-armv8l-1 57 test-equinix_esxi-ubuntu2204-x64-2 57 test-ibmcloud-rhel6-x64-1 43 test-macincloud-macos1201-x64-1 43 test-macincloud-macos1201-x64-2 57 test-osuosl-aix72-ppc64-5 51 test-osuosl-ubuntu1604-ppc64le-3 57 test-osuosl-ubuntu1604-ppc64le-4 36 test-packet-ubuntu1604-armv8-2-OFF 51 test-rise-debian12-riscv64-4 57 test-rise-debian12-riscv64-9 35 trss-node