adoptium / infrastructure

This repo contains all information about machine maintenance.
Apache License 2.0
86 stars 101 forks source link

System unavailable: test-osuosl-aix71-ppc64-1 #1779

Closed andrew-m-leonard closed 3 years ago

andrew-m-leonard commented 3 years ago

test-osuosl-aix71-ppc64-1 out of disk space

andrew-m-leonard commented 3 years ago

6.3G /home/jenkins/workspace/Test_openjdk11_j9_extended.functional_ppc64_aix 11G /home/jenkins/workspace/Test_openjdk8_j9_sanity.functional_ppc64_aix

andrew-m-leonard commented 3 years ago

Culprit: /home/jenkins/workspace/Test_openjdk8_j9_sanity.functional_ppc64_aix/openjdk-tests/TKG/test_output_16086634784597/cmdLineTester_callsitedbgddrext_openj9_0

-rw------- 1 jenkins staff 9592217600 Dec 22 19:46 j9core.dmp
sxa commented 3 years ago

Potential duplicate of #1772 but we can leave open until we confirm

sxa commented 3 years ago

No this is separate and is down to a relatively small filesystem (~17Gb) and multiple core files being produced, but not related to the AWT library

adamfarley commented 3 years ago

Late last week, the test role was added back to this machine. See slack thread for details.

Over the weekend, we saw a slew of space-related failures occurring across all test types on this machine.

Example: https://trss.adoptopenjdk.net/output/build?id=603a43ce5730424dbc92c820

Caused by: hudson.plugins.git.GitException: Command "git init /home/jenkins/workspace/Test_openjdk8_j9_sanity.functional_ppc64_aix/openjdk-tests" returned status code 128:
stdout: 
stderr: error: copy-fd: write returned: No space left on device
fatal: cannot copy '/opt/freeware/share/git-core/templates/description' to '/home/jenkins/workspace/Test_openjdk8_j9_sanity.functional_ppc64_aix/openjdk-tests/.git/description': No space left on device

Also, we see this earlier on in the test. One theory is that one test tried to clone openjdk-tests and the clone failed midway through, after eating up what little remaining free space there was.

 > git rev-parse --is-inside-work-tree # timeout=10
ERROR: Workspace has a .git repository, but it appears to be corrupt.
hudson.plugins.git.GitException: Command "git rev-parse --is-inside-work-tree" returned status code 128:
stdout: 
stderr: fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
aixtools commented 3 years ago
root@p9-aix1-ojdk05:[/root]df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/hd4        4.0G  187M  3.9G   5% /
/dev/hd2        6.0G  4.6G  1.5G  77% /usr
/dev/hd9var     6.0G  1.6G  4.5G  27% /var
/dev/hd3        4.0G  4.0G     0 100% /tmp
/dev/hd1         24G   24G     0 100% /home
/dev/hd11admin  128M  380K  128M   1% /admin
/proc              -     -     0    - /proc
/dev/hd10opt    8.0G  2.0G  6.1G  25% /opt
/dev/livedump   256M  368K  256M   1% /var/adm/ras/livedump
/dev/lvBESC     2.0G  299M  1.8G  15% /var/opt/BESClient
/dev/fslv00     128M  128M     0 100% /audit

There are 32bit and 64bit binary versions available for bash

In this release, process substitution is not completely working. The output of a command might not be redirected correctly when using <(cmd) or >(cmd). root@p9-aix1-ojdk05:[/tmp]ls -lt sh-np | head prw------- 1 jenkins staff 0 Feb 19 13:23 sh-np.BEGaaa prw------- 1 jenkins staff 0 Feb 19 13:23 sh-np.BEGaab prw------- 1 jenkins staff 0 Feb 19 13:23 sh-np.CRmMaa prw------- 1 jenkins staff 0 Feb 19 13:23 sh-np.CRmMab prw------- 1 jenkins staff 0 Feb 19 13:23 sh-np.DVmMaa prw------- 1 jenkins staff 0 Feb 19 13:23 sh-np.DVmMab prw------- 1 jenkins staff 0 Feb 19 13:23 sh-np.Ffr7aa prw------- 1 jenkins staff 0 Feb 19 13:23 sh-np.Ffr7ab prw------- 1 jenkins staff 0 Feb 19 13:23 sh-np.0uI7aa prw------- 1 jenkins staff 0 Feb 19 13:23 sh-np.0uI7ab root@p9-aix1-ojdk05:[/tmp]ls -ltr sh-np | head prw------- 1 jenkins staff 0 Feb 19 13:01 sh-np.fvlqab prw------- 1 jenkins staff 0 Feb 19 13:01 sh-np.fvlqaa prw------- 1 jenkins staff 0 Feb 19 13:01 sh-np.yiAaab prw------- 1 jenkins staff 0 Feb 19 13:01 sh-np.yiAaaa prw------- 1 jenkins staff 0 Feb 19 13:01 sh-np.yfaMab prw------- 1 jenkins staff 0 Feb 19 13:01 sh-np.yfaMaa prw------- 1 jenkins staff 0 Feb 19 13:01 sh-np.yeDaab prw------- 1 jenkins staff 0 Feb 19 13:01 sh-np.yeDaaa prw------- 1 jenkins staff 0 Feb 19 13:01 sh-np.ybl7ab prw------- 1 jenkins staff 0 Feb 19 13:01 sh-np.ybl7aa root@p9-aix1-ojdk05:[/tmp]

* And I see, before I could research the rest - someone else made space on the system.

root@p9-aix1-ojdk05:[/tmp]df -h Filesystem Size Used Avail Use% Mounted on /dev/hd4 4.0G 187M 3.9G 5% / /dev/hd2 6.0G 4.6G 1.5G 77% /usr /dev/hd9var 6.0G 1.6G 4.5G 27% /var /dev/hd3 4.0G 123M 3.9G 3% /tmp /dev/hd1 24G 24G 0 100% /home /dev/hd11admin 128M 380K 128M 1% /admin /proc - - 0 - /proc /dev/hd10opt 8.0G 2.0G 6.1G 25% /opt /dev/livedump 256M 368K 256M 1% /var/adm/ras/livedump /dev/lvBESC 2.0G 299M 1.8G 15% /var/opt/BESClient /dev/fslv00 128M 128M 0 100% /audit

sxa commented 3 years ago

Machine running out of disk space due to multiple cores being generated during execution of Test_openjdk17_j9_extended.functional_ppc64_aix

FYI @Haroon-Khel @aixtools if this isn't occurring on other machines we need to find out what the issue is on this machine - marking it offline for now

aixtools commented 3 years ago

OK - looking in /home/jenkins/workspace - lots of directories with 0 MB, and then:

0       build-scripts
1       Grinder@tmp
1       Test_openjdk16_j9_extended.functional_ppc64_aix@tmp
1       Test_openjdk17_j9_extended.functional_ppc64_aix@tmp
1       workspaces.txt
24184   Test_openjdk17_j9_extended.functional_ppc64_aix

root@p9-aix1-ojdk05:[/home/jenkins/workspace]cd Test_openjdk17_j9_extended.functional_ppc64_aix
root@p9-aix1-ojdk05:[/home/jenkins/workspace/Test_openjdk17_j9_extended.functional_ppc64_aix]du -sm * | sort -n
3       functional_test_output.tar.gz
428     jvmtest
1233    openjdkbinary
22522   openjdk-tests

root@p9-aix1-ojdk05:[/home/jenkins/workspace/Test_openjdk17_j9_extended.functional_ppc64_aix/openjdk-tests]du -sm * | sort -n
1       LICENSE
1       NOTICE
1       README.md
1       SECURITY.md
1       TestConfig
1       Utils
1       autoGen.mk
1       buildenv
1       external
1       get.sh
1       jck
1       openjdk
1       system
2       doc
26      perf
137     functional
22292   TKG

root@p9-aix1-ojdk05:[/home/jenkins/workspace/Test_openjdk17_j9_extended.functional_ppc64_aix/openjdk-tests/TKG]du -sm * | sort -n
1       LICENSE
1       README.md
1       SECURITY.md
1       SHA.txt
1       autoGenEnv.mk
1       bin
1       clean.mk
1       compile.mk
1       envSettings.mk
1       featureSettings.mk
1       makeGen.mk
1       makefile
1       moveDmp.mk
1       openj9Settings.mk
1       playlist.xsd
1       resources
1       runtest.mk
1       scripts
1       settings.mk
1       src
1       testEnv.mk
1       utils.mk
5       lib
22287   output_16143343964904

root@p9-aix1-ojdk05:[/home/jenkins/workspace/Test_openjdk17_j9_extended.functional_ppc64_aix/openjdk-tests/TKG/output_16143343964904]du -sm * | sort -n
2       TestTargetResult
3330    threadMXBeanTimedParkTest_2
9475    threadMXBeanTestSuite2_2
9482    threadMXBeanTestSuite1_6
root@p9-aix1-ojdk05:[/home/jenkins/workspace/Test_openjdk17_j9_extended.functional_ppc64_aix/openjdk-tests/TKG/output_16143343964904]ls -l
total 1696
-rw------- 1 jenkins staff 1736704 Feb 26 11:37 TestTargetResult
drwx------ 2 jenkins staff     256 Feb 26 11:17 threadMXBeanTestSuite1_6
drwx------ 2 jenkins staff     256 Feb 26 11:30 threadMXBeanTestSuite2_2
drwx------ 2 jenkins staff     256 Feb 26 11:37 threadMXBeanTimedParkTest_2

root@p9-aix1-ojdk05:[/home/jenkins/workspace/Test_openjdk17_j9_extended.functional_ppc64_aix/openjdk-tests/TKG/output_16143343964904]cd threadMXBeanTimedParkTest_2
root@p9-aix1-ojdk05:[/home/jenkins/workspace/Test_openjdk17_j9_extended.functional_ppc64_aix/openjdk-tests/TKG/output_16143343964904/threadMXBeanTimedParkTest_2]du -sm * | sort -n
3330    core.20210226.113712.28639248.0001.dmp

root@p9-aix1-ojdk05:[/home/jenkins/workspace/Test_openjdk17_j9_extended.functional_ppc64_aix/openjdk-tests/TKG/output_16143343964904/threadMXBeanTimedParkTest_2]cd ..
root@p9-aix1-ojdk05:[/home/jenkins/workspace/Test_openjdk17_j9_extended.functional_ppc64_aix/openjdk-tests/TKG/output_16143343964904]ls -lR
.:
total 1696
-rw------- 1 jenkins staff 1736704 Feb 26 11:37 TestTargetResult
drwx------ 2 jenkins staff     256 Feb 26 11:17 threadMXBeanTestSuite1_6
drwx------ 2 jenkins staff     256 Feb 26 11:30 threadMXBeanTestSuite2_2
drwx------ 2 jenkins staff     256 Feb 26 11:37 threadMXBeanTimedParkTest_2

./threadMXBeanTestSuite1_6:
total 9708632
-rw------- 1 jenkins staff     438972 Feb 26 11:17 Snap.20210226.111640.31391816.0003.trc
-rw------- 1 jenkins staff 9934845663 Feb 26 11:17 core.20210226.111640.31391816.0001.dmp
-rw------- 1 jenkins staff    1115725 Feb 26 11:17 javacore.20210226.111640.31391816.0002.txt
-rw------- 1 jenkins staff    8580476 Feb 26 11:17 jitdump.20210226.111640.31391816.0005.dmp

./threadMXBeanTestSuite2_2:
total 9701616
-rw------- 1 jenkins staff     446780 Feb 26 11:30 Snap.20210226.112933.28639480.0003.trc
-rw------- 1 jenkins staff 9936288391 Feb 26 11:30 core.20210226.112933.28639480.0001.dmp
-rw------- 1 jenkins staff    1121857 Feb 26 11:30 javacore.20210226.112933.28639480.0002.txt
-rw------- 1 jenkins staff        202 Feb 26 11:30 jitdump.20210226.112933.28639480.0005.dmp

./threadMXBeanTimedParkTest_2:
total 3409180
-rw------- 1 jenkins staff 3494121472 Feb 26 11:37 core.20210226.113712.28639248.0001.dmp

I am copying the TKG directory - before removing it, so someone with understanding can look at the .dmp files

sxa commented 3 years ago

There's a certain irony in the fact that the only machine that can generate core files ran out of disk space when it runs the tests that generate them ;-)

aixtools commented 3 years ago

There's a certain irony in the fact that the only machine that can generate core files ran out of disk space when it runs the tests that generate them ;-)

I think that is also known as Murphy's Law - in some variation or another.

aixtools commented 3 years ago

Just ran a build run on this system - as the test-ibm-aix71-ppc64-{1,2} are unavailable. https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk/job/jdk-aix-ppc64-openj9/385/

@andrew-m-leonard - are you ok that we close this one - as no longer relevant (as in no longer occurring)?

andrew-m-leonard commented 3 years ago

yes good, thanks