adoptium / infrastructure

This repo contains all information about machine maintenance.
Apache License 2.0
86 stars 101 forks source link

test-azure-win2012r2-x64-1 jobs failing with no space left on device errors #2868

Open smlambert opened 1 year ago

smlambert commented 1 year ago
15:28:51  Caused by: hudson.plugins.git.GitException: Command "git init D:\jenkins\workspace\Test_openjdk11_hs_sanity.openjdk_x86-32_windows\aqa-tests" returned status code 128:
15:28:51  stdout: 
15:28:51  stderr: error: copy-fd: write returned: No space left on device
15:28:51  fatal: cannot copy '/usr/share/git-core/templates/hooks/commit-msg.sample' to '/cygdrive/d/jenkins/workspace/Test_openjdk11_hs_sanity.openjdk_x86-32_windows/aqa-tests/.git/hooks/commit-msg.sample': No space left on device
15:28:51  

Other details: Will mark the machine offline to avoid other tests being sent to it and fail.

steelhead31 commented 1 year ago

This looks like the size of the D: drive needs increasing, rather than additional housekeeping a single test run is using > 30Gb, In the interim, I'll add an additional check to this windows machines in nagios to monitor the size of D:

sxa commented 1 year ago

@smlambert Is this size of the working set expected for the new dev suites? We should check that we're not just creating lots of core files around or something like that (I'm not sure how much testing has been done with these suites on OpenJ9 to know if this is a symptom of a real underlying problem)

smlambert commented 1 year ago

New dev.system tests are loads of jcstress tests (and require some additional space during runs). dev.openjdk tests are container tests and would also have requirements, but do not run on Windows currently.

Core files should get packaged into the test artifact on the job, if they were happening and I do not see any bloat there (12-20KB on average for system_test_output.tar.gz artifacts).

sxa commented 1 week ago

Machine decommissioned and jstress tests have been disabled due to being resource hogs (can take up to 3 days) and are not particularly effective at identifying problems. May be useful to ensure that we can still run them on our systems for developers who want to verify that GC changes don't cause any new problems.