adoptium / infrastructure

This repo contains all information about machine maintenance.
Apache License 2.0
86 stars 101 forks source link

JDK17 Extended Test Failures On test-osuosl-aix72-ppc64-5 due to filling /tmp #3129

Open steelhead31 opened 1 year ago

steelhead31 commented 1 year ago

The JDK17 extended test suites failed when running on the test-osuosl-aix72-ppc64-5 due to filling /tmp.

The error can be seen here ( as well as in Nagios ) https://ci.adoptium.net/job/Test_openjdk17_hs_extended.openjdk_ppc64_aix_testList_2/

The test job appeared to create 2 x 2.1 GB tmp files in /tmp filling the entire file system, and causing tests to fail.

aixtools commented 1 year ago

I am looking at all the systems - and as they are all recently cloned it appears there have been (undocumented?) changes to the system configurations.

When there are issues these should not be hacked at on the fly. There needs to be - at the minimum - reported in the issue what was done - and perhaps an update to the playbooks.

As an example: the size of 4G for /tmp was chosen because the test usedd to be smaller - and 4G was sufficient by nearly 2G. If the test is now doing 2x 2+G, obviously 4G is not going to work.

YET: when I look at the systems /tmp has not been increased, but /var has been increased on two systems.

I cannot second guess what needs to be done when changes are made on the fly.

So, no (known) action taken to resolve this issue. And it looks like it is just waiting to happen again - on different systems.

aixtools commented 1 year ago

Seems to be affecting some, but not all systems: (note 100% used below).

root@osunim:[/root]dsh-adopt "/usr/bin/df -g /tmp"
adopt01:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           4.00      3.99    1%       47     1% /tmp
==============
adopt02:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           4.00      4.00    1%       44     1% /tmp
==============
adopt03:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           4.00      2.97   26%     1665     1% /tmp
==============
adopt04:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           4.00      3.40   16%      252     1% /tmp
==============
adopt05:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           4.00      3.99    1%      535     1% /tmp
==============
adopt06:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           4.00      0.00  100%      406     7% /tmp
==============
adopt07:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           4.00      0.00  100%      469     9% /tmp
==============
adopt08:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           4.00      3.99    1%      503     1% /tmp
==============
adopt10:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           4.00      3.99    1%      116     1% /tmp
==============
aixtools commented 1 year ago
aixtools commented 1 year ago

Looks like there may still be an artifact:

adopt07:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           5.00      0.99   81%      632     1% /tmp

image

aixtools commented 1 year ago

Just wondering if this is a problem with the test.

      4 -rw-r--r-- 1 jenkins staff          40 Aug 27 16:22 blah4255219114647392657.tmp
      4 -rw-r--r-- 1 jenkins staff         151 Aug 26 17:33 unsigned.jar1450541346237646654jar
      4 -rw-r--r-- 1 jenkins staff         305 Aug 26 15:27 test1723908656910621468.test
      4 -rw-r--r-- 1 jenkins staff         383 Aug 27 17:39 test10517561616964218431.test
      4 -rw-r--r-- 1 jenkins staff         403 Aug 27 17:39 test15750855502312122436.test
      4 -rw-r--r-- 1 jenkins staff         403 Aug 27 17:39 test16807323090330678638.test
      4 -rw-r--r-- 1 jenkins staff        1862 Aug 26 17:33 signed.jar8571074910892324627jar
      4 -rw-r--r-- 1 jenkins staff        1974 Aug 26 17:33 signed2.jar1180279166009667648jar
      4 -rw-r--r-- 1 jenkins staff       32007 Aug 27 16:22 source245824410068849651.tmp
      4 -rw-r--r-- 1 jenkins staff  6442450960 Aug 27 16:23 source1323321727058409565.tmp
      4 -rw-r--r-- 1 root    system          6 Jun 20 10:57 rc.net.out
      4 -rw-r--r-- 1 root    system         24 Jun 20 11:08 NIM_instp_updt_list
      4 -rw-r--r-- 1 root    system         77 Jun 20 10:57 KrsctPHA.saved
      4 -rw-r--r-- 1 root    system       2124 Jun 20 10:57 ctrmc_MDdr.dbg
      4 -rw-rw-r-- 1 root    system         53 Jun 20 10:54 uncfgct.dbg
      4 -rw-rw-r-- 1 root    system        676 Jun 20 10:55 rsct_cfgct_history.log
      8 -rw------- 1 jenkins staff  2147484671 Aug 27 16:22 src6628553441366702378.dat
    136 -rw-r--r-- 1 jenkins staff      138481 Aug 27 13:50 hs_err_pid19726826.log
    200 -rw-rw-r-- 1 root    system     204800 Aug 29 15:00 lvmt.log
   1024 -rw-r--r-- 1 jenkins staff     1048576 Aug 27 16:22 blah17126132827369752914.tmp
2097156 -rw------- 1 jenkins staff  2147484671 Aug 27 16:22 dst1075097809748986483.dat
2097160 -rw-r--r-- 1 jenkins staff  2147484671 Aug 27 16:23 dst2207767882312241704.dat
sxa commented 3 days ago

Needs to be examined further to determine in a clear environment, and ideally to narrow down which tests in the external suites are causing the problem.

OpenJDK have discussed test cases not always cleaning up after themselves.