adoptium / infrastructure

This repo contains all information about machine maintenance.
Apache License 2.0
85 stars 101 forks source link

Track corrupt workspaces on Windows Machines #1396

Closed Willsparker closed 3 years ago

Willsparker commented 4 years ago

Ref: https://github.com/AdoptOpenJDK/openjdk-build/issues/1855

These have been regularly cropping up in 2 forms:

This issue is going to act as a way of tracking those sorts of issues to determine if there's a pattern / cause and a way of fixing it so a cleanup isn't required.

Willsparker commented 4 years ago

@karianna was attempting to run a build on build-softlayer-win2012r2-x64-1 in which it was failing with the following error:

Suppressed: java.nio.file.AccessDeniedException: C:\workspace\openjdk-build\makejdk-any-platform.sh

Both Admin and Jenkins didn't have permissions to delete or access the workspace as it apparently had Unknown+User as an owner:

Administrator@build-softlayer-win2012r2-x64-1 /cygdrive/c/workspace/openjdk-build
$ ls -la
total 16
drwxrwx---+ 1 jenkins      None             0 Jun 16 04:43 .
drwxrwx---+ 1 jenkins      None             0 Jun 10 07:02 ..
-rw-r-----  1 Unknown+User Unknown+Group 2137 Jun 10 07:02 makejdk-any-platform.                                                             sh
drwxrwxr-x+ 1 jenkins      None             0 Jun 16 04:43 sbin
drwxr-x---  1 Unknown+User Unknown+Group    0 Jun 16 04:44 workspace

The following processes were using those filesystems:

Screenshot 2020-06-16 at 11 20 09

Once those processes were stopped, makejdk-any-platform.sh and workspace disappeared on their own, and the sbin folder was able to be deleted.

Willsparker commented 4 years ago

Ref: https://github.com/AdoptOpenJDK/openjdk-build/pull/1892 If the above PR works, it should stop the ..the_ file issues.

Willsparker commented 4 years ago

https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1410#issuecomment-648698797

^ This comment is a bit of a more in-depth look at what's stopping Jenkins from deleting a workspace. It appears to be due to a java process that is writing to a log file after a test that isn't being stopped properly by Jenkins. @M-Davies , I know you have a fair amount of experience with tests & Jenkins, any ideas? (sorry to put you on the spot) :-)

M-Davies commented 4 years ago

At first I thought this may have been due to an aborted run on the affected machine (which usually causes problems like this) but this didn't appear the case as no grinders and only two unsuccessful test runs have been executed on the machine recently.

The troublesome java process that wasn't cleaned up was created in the first run as the file path in the test_ouput matches the one reported by @Willsparker in https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1410#issuecomment-648698797 and I also see the exact same final words in the javacore file of the first run as I do in #1410 .

The only thing that confuses me is why didn't the second run fail with the error reported in #1410 ? Why did it take another build before we saw this error?

A temp solution for this could be to ensure we always force cleanup the workspaces on the test machines PRIOR to running any tests but if the file cannot be deleted by Jenkins or Admin then I'm not sure how we would go about doing that.

Hope this helps! :)

M-Davies commented 4 years ago

A temp solution for this could be to ensure we always force cleanup the workspaces on the test machines PRIOR to running any tests but if the file cannot be deleted by Jenkins or Admin then I'm not sure how we would go about doing that.

So we do actually clean up prior to running tests but it's not forced and I don't think there is a way for cleanws() to be forced

Willsparker commented 4 years ago

I think the way this needs to be resolved is to ensure all the Java processes have stopped after any given test run - no process, no locked file and cleanWS() should be able to do it's job. My concern is that Jenkins really should be doing it by itself due to ProcessTreeKiller. This is linked to https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/770 - The outcome of which has led to this Jenkins Job : https://ci.adoptopenjdk.net/job/SXA-processCheck/

The problem is that this job only runs once every 7 days, which is evidently not enough to catch the rogue processes. I suppose we could just cause every test job to trigger a processCheck run once it's completed, whether it's finished or not. ping @sxa

Willsparker commented 4 years ago

Ref: #1415 The referenced issue will be a good way of keeping a track of the files that aren't able to be cleaned-up automatically by Jenkins

Willsparker commented 3 years ago

Ref: https://ci.adoptopenjdk.net/view/Failing%20Builds/job/build-scripts/job/jobs/job/jdk/job/jdk-windows-x64-openj9/69/console (from the issue @M-Davies opened above)

Affected build-softlayer-win2012r2-x64-1, directory C:/workspace/openjdk-build. I was unable to delete it manually with the Administrator user, but was able to with the Jenkins user, weirdly. Haven't seen that one before, as usually the permissions one is due to a leftover process still using the folder, which has to be stopped and then deleted.

M-Davies commented 3 years ago

A file is locked by a process, and often both Administrator and Jenkins can't delete them due to not having permissions. To rectify this, the file path can be searched in the Associated Handles section, under the CPU tab, in the resource monitor. Once the processes that are using those files are found, they can be right clicked and forced exited. The file can then be deleted by the Administrator user.

@Willsparker https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk8_j9_sanity.openjdk_x86-32_windows/277/console sees test-softlayer-win2012r2-x64-2 reproduce this error

Willsparker commented 3 years ago

@M-Davies Right you are- java.exe process has been stopped & the folder has been manually deleted- thanks :-)

@sxa Can we make it so your SXA-processCheck job kicks off for that machine after every test run?

M-Davies commented 3 years ago

@Willsparker test-azure-win2012r2-x64-1 has the same problem as java.exe not being stopped https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk8_j9_sanity.openjdk_x86-64_windows_xl/256/console

Willsparker commented 3 years ago

This is a new one - there's 10 java.exe processes running... Also I like this output:

Also:   java.nio.file.AccessDeniedException: **C:\Users\jenkins\workspace\Test_openjdk8_j9_sanity.openjdk_x86-64_windows_xl\openjdkbinary\j2sdk-image\bin\java.exe**

Highly useful 😁 I stopped a handful and then the rest stopped, and deleted C:\Users\jenkins\workspace\Test_openjdk8_j9_sanity.openjdk_x86-64_windows_xl\openjdkbinary\ manually 👍

Willsparker commented 3 years ago

ref: https://github.com/AdoptOpenJDK/TKG/issues/45

Willsparker commented 3 years ago

Ref: https://github.com/AdoptOpenJDK/openjdk-build/pull/2076#issuecomment-691131667 Machine was: build-azure-win2012r2-x64-2

Initial thoughts is that the run was trying to delete a directory that was in use in a different run- however there's only one executor on the machine so that couldn't happen unless someone was running a build locally on the machine. When I got to it this morning, I was able to delete the folder without any issues, so whichever process was using those folders stopped over the weekend.

andrew-m-leonard commented 3 years ago

https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-windows-x64-openj9/895/console

17:02:45  Compiling 2 files for BUILD_DEMO_APPLET_GraphicsTest
17:02:45  Compiling 1 files for BUILD_DEMO_APPLET_NervousText
17:02:45  Compiling 1 files for BUILD_DEMO_APPLET_SimpleGraph
17:02:45  Compiling 5 files for BUILD_DEMO_APPLET_SortDemo
17:02:45  Could not create temporary file name for at file!
17:02:45  make[2]: *** [CompileDemos.gmk:75: /cygdrive/e/jenkins/tmp/workspace/build/src/build/windows-x86_64-normal-server-release/jdk/demo/applets/SortDemo/_the.BUILD_DEMO_APPLET_SortDemo_batch] Error 127
17:02:45  make[2]: *** Waiting for unfinished jobs....
17:02:45  Compiling 1 files for BUILD_DEMO_APPLET_SpreadSheet
17:02:47  make[1]: *** [BuildJdk.gmk:93: demos] Error 2
17:02:47  make: *** [/cygdrive/e/jenkins/tmp/workspace/build/src//make/Main.gmk:133: demos-only] Error 2

Error comes from here: https://github.com/ibmruntimes/openj9-openjdk-jdk8/blob/f9b33244a76dfd671907872ad6800f3e26e6a4a7/common/src/fixpath.c#L254

andrew-m-leonard commented 3 years ago

I would suggest putting some extra debug into: https://github.com/ibmruntimes/openj9-openjdk-jdk8/blob/f9b33244a76dfd671907872ad6800f3e26e6a4a7/common/src/fixpath.c#L254 in a fork and building to recreate, we could do with the "rc" and the file "name"

sxa commented 3 years ago

We did some experimentation on a call earlier today and the atfile issue seems to have been caused by a lot of leftover atfile_ in the jenkins user's %TEMP% directory - possibly from previous runs with the fixpath debug option enabled. Clearing these out on the machine (ibmcloud build machine 1) appears to have resolved that particular issue.

The changes we are implementing for https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1573 may well mean we can close this soon (Fingers crossed...)

Willsparker commented 3 years ago

Okay- with https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1573 & https://github.com/AdoptOpenJDK/openjdk-tests/pull/2059 , I haven't seen anymore leftover Java processes on Windows Boxes, issues (thank god). Closing :-)