Closed sxa closed 4 years ago
Two new machines added build-marist-rhel77-s390x-3 and build-marist-rhel77-s390x-4 with the same kernel level as the failing machine. It's also worth noting that since disabling the machine in jenkins it hasn't crashed. I've re-enabled it to see if it falls over tonight
New machine 148.100.245.197 (-4
) has just crashed during a build: https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk14/job/jdk14-linux-s390x-openj9lastFailedBuild/console
build-marist-rhel77-s390x-2
crashed during a jdk14 build https://ci.adoptopenjdk.net/view/Failing%20Builds/job/build-scripts/job/jobs/job/jdk14/job/jdk14-linux-s390x-hotspot/39
Swapfile disabled on next reboot on machines 2 to 4. I've rebooted -2
so that will take effect immediately. We'll see if that makes any difference tonight.
-3
has been upgraded to have 16GB of RAM so I've re-enabled it alongside -1
and we'll see if it's any more stable
-3
worked ok yesterday although it only seems to have ran one of the jobs: https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-linux-s390x-openj9-linuxXL/103/consoleFull
Rerunning another pipeline and the following are running on -3
:
I think I'll mark -1 offline tonight to force everything to -3 and see what happens
The 16Gb system failed again today. I'm going to start logging the failures: -2 https://ci.adoptopenjdk.net/view/Failing%20Builds/job/build-scripts/job/jobs/job/jdk11u/job/jdk11u-linux-s390x-openj9/536/consoleFull
Latest kernel update from RedHat appears to have resolved this on all machines regardless of memory/swap setup - it was installed on the 19th March and none of the machines have crashed in the last week. OpenJ9 (via @jdekonin) reporting the same success so I'm going to close this :-)
[root@adoptopenjdk01 ~]# rpm -qi kernel-3.10.0-1062.18.1.el7.s390x
Name : kernel
Version : 3.10.0
Release : 1062.18.1.el7
Architecture: s390x
Install Date: Thu 19 Mar 2020 01:00:29 EDT
Group : System Environment/Kernel
For completeness, the original machine was on this kernel:
[linux1@localhost ~]$ uname -a
Linux localhost.adoptopenjdk.net 3.10.0-957.21.3.el7.s390x #1 SMP Fri Jun 14 02:52:25 EDT 2019 s390x s390x s390x GNU/Linux
The failing ones were on
[linux1@adoptopenjdk01 ~]$ uname -a
Linux adoptopenjdk01.novalocal 3.10.0-1062.12.1.el7.s390x #1 SMP Thu Dec 12 06:45:30 EST 2019 s390x s390x s390x GNU/Linux
And the new ones are:
[root@adoptopenjdk03 ~]# uname -a
Linux adoptopenjdk03.novalocal 3.10.0-1062.18.1.el7.s390x #1 SMP Wed Feb 12 09:11:02 EST 2020 s390x s390x s390x GNU/Linux
The first build machine seems ok, but the second one (148.100.86.218) is repeatedly falling over:
Looks like a kernel crash as follows: