Open sxa opened 2 years ago
Time to execute the above job was not unduly affected my the machine load
There are a few leftover jenkins processes from yesterday running as the jenkins user although not using significants amount of CPU time:
root@test-osuosl-ubuntu1604-ppc64le-2:/var/log# ps augwwx | grep jenkins
jenkins 1126 0.0 0.0 10560 7872 ? Ss Sep12 0:21 /lib/systemd/systemd --user
jenkins 1130 0.0 0.0 160448 5760 ? S Sep12 0:00 (sd-pam)
jenkins 3086 0.0 0.0 3072 1344 ? S Oct05 0:00 sh -c ulimit -c unlimited && /home/jenkins/workspace/Grinder/openjdkbinary/j2sdk-image/bin/java -ea -esa -Xmx512m --enable-preview -Xint -XX:+CreateCoredumpOnCrash -Djava.library.path=/home/jenkins/workspace/Grinder/openjdkbinary/openjdk-test-image/hotspot/jtreg/native -cp /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16649842363953/hotspot_serviceability_1/work/classes/0/serviceability/sa/ClhsdbFindPC_no-xcomp-core.d:/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16649842363953/hotspot_serviceability_1/work/classes/0/test/lib jdk.test.lib.apps.LingeredApp f51a16ce-8db7-4dd1-9bae-47f397683477.lck forceCrash
jenkins 3088 0.0 0.5 3023424 49600 ? Dl Oct05 0:03 /home/jenkins/workspace/Grinder/openjdkbinary/j2sdk-image/bin/java -ea -esa -Xmx512m --enable-preview -Xint -XX:+CreateCoredumpOnCrash -Djava.library.path=/home/jenkins/workspace/Grinder/openjdkbinary/openjdk-test-image/hotspot/jtreg/native -cp /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16649842363953/hotspot_serviceability_1/work/classes/0/serviceability/sa/ClhsdbFindPC_no-xcomp-core.d:/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16649842363953/hotspot_serviceability_1/work/classes/0/test/lib jdk.test.lib.apps.LingeredApp f51a16ce-8db7-4dd1-9bae-47f397683477.lck forceCrash
jenkins 3106 0.0 0.3 38656 30272 ? S Oct05 0:00 /usr/bin/python3 /usr/share/apport/apport 3088 6 18446744073709551615 1 3088 !home!jenkins!workspace!Grinder!openjdkbinary!j2sdk-image!bin!java
root 6087 0.0 0.0 9984 1920 pts/0 S+ 13:38 0:00 grep --color=auto jenkins
root 19468 0.0 0.1 18176 13312 ? Ss Sep13 0:00 sshd: jenkins [priv]
jenkins 19521 0.0 0.1 18816 10304 ? S Sep13 1:57 sshd: jenkins@notty
jenkins 19562 0.0 0.0 10688 2112 ? Ss Sep13 0:00 bash -c cd "/home/jenkins" && java -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300 -jar remoting.jar -workDir /home/jenkins -jar-cache /home/jenkins/remoting/jarCache
jenkins 19563 0.5 4.5 4590656 381440 ? Sl Sep13 175:24 java -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300 -jar remoting.jar -workDir /home/jenkins -jar-cache /home/jenkins/remoting/jarCache
root@test-osuosl-ubuntu1604-ppc64le-2:/var/log#
Interestingly those hung Grinder processes are likely the ones I was using for replicating/testing https://github.com/adoptium/aqa-tests/issues/4006#issuecomment-1268330876 - the java process in the above output (3088) is not responding to a kill -KILL
. I might leave it for a while to see if it disappears before triggering a reboot given that it doesn't seem to be disrupting any other execution at the moment.
Rebooted.
Generated some more alerts overnight so reopening. The -1
machine is running the same kernel (4.4.0-210-generic
and is not having the same problem.
At the time of writing it is has a load of zero.
We are getting warning messages from Nagios that the machine is sitting with a load of 17.00:
HOST: test-osuosl-ubuntu1604-ppc64le-2 SERVICE: Current Load STATE: WARNING MESSAGE: WARNING - load average: 17.00, 17.00, 17.00 [See Nagios](https://nagios.adoptopenjdk.net/nagios/cgi-bin/status.cgi?host=test-osuosl-ubuntu1604-ppc64le-2)
Note that while it's currently stuck up there at 0904 this morning Nagios declared it good again with load averages of
0.04 0.05 1.02
, but then it went up again.This machine running Ubuntu 16.04.7 and has been online for over a year:
There are no obvious processes using lots of CPU, although there has been a recent kernel exception:
I've kicked off https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk8_hs_sanity.openjdk_ppc64le_linux/742/ to see if the machine is actually acting slow due to the high load, but I expect a reboot will be in order.