Closed Willsparker closed 3 years ago
@karianna was attempting to run a build on build-softlayer-win2012r2-x64-1
in which it was failing with the following error:
Suppressed: java.nio.file.AccessDeniedException: C:\workspace\openjdk-build\makejdk-any-platform.sh
Both Admin and Jenkins didn't have permissions to delete or access the workspace as it apparently had Unknown+User
as an owner:
Administrator@build-softlayer-win2012r2-x64-1 /cygdrive/c/workspace/openjdk-build
$ ls -la
total 16
drwxrwx---+ 1 jenkins None 0 Jun 16 04:43 .
drwxrwx---+ 1 jenkins None 0 Jun 10 07:02 ..
-rw-r----- 1 Unknown+User Unknown+Group 2137 Jun 10 07:02 makejdk-any-platform. sh
drwxrwxr-x+ 1 jenkins None 0 Jun 16 04:43 sbin
drwxr-x--- 1 Unknown+User Unknown+Group 0 Jun 16 04:44 workspace
The following processes were using those filesystems:
Once those processes were stopped, makejdk-any-platform.sh
and workspace
disappeared on their own, and the sbin
folder was able to be deleted.
Ref: https://github.com/AdoptOpenJDK/openjdk-build/pull/1892
If the above PR works, it should stop the ..the_
file issues.
https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1410#issuecomment-648698797
^ This comment is a bit of a more in-depth look at what's stopping Jenkins from deleting a workspace. It appears to be due to a java process that is writing to a log file after a test that isn't being stopped properly by Jenkins. @M-Davies , I know you have a fair amount of experience with tests & Jenkins, any ideas? (sorry to put you on the spot) :-)
At first I thought this may have been due to an aborted run on the affected machine (which usually causes problems like this) but this didn't appear the case as no grinders and only two unsuccessful test runs have been executed on the machine recently.
20-Jun-2020
had one test failure that had some trouble cleaning itself up at various stages during the process:
00:26:18 Attempting to destroy all caches in cacheDir C:\Users\jenkins.TEST-WIN2012-2\AppData\Local\javasharedresources\
00:26:18
00:26:18 JVMSHRC241E Error: unable to delete shared class cache file
00:26:18 JVMSHRC336E Port layer error code = -100
00:26:18 JVMSHRC337E Platform error message: (32) The process cannot access the file because it is being used by another process.
00:26:18 JVMSHRC430I Failed to remove current generation of shared class cache "sharedcc_jenkins"
00:26:18
00:26:18 Attempting to destroy all caches in cacheDir C:\Users\jenkins.TEST-WIN2012-2\AppData\Local\javasharedresources\
00:26:18
00:26:18 JVMSHRC241E Error: unable to delete shared class cache file
00:26:18 JVMSHRC336E Port layer error code = -100
00:26:18 JVMSHRC337E Platform error message: (32) The process cannot access the file because it is being used by another process.
00:26:18 JVMSHRC430I Failed to remove current generation of shared class cache "sharedcc_jenkins"
00:26:18 cache cleanup done
00:26:18 jdk_lang_j9_0 Finish Time: Sun Jun 21 00:26:17 2020 Epoch Time (ms): 1592717177854
00:26:18
00:26:18 ===============================================
00:26:18 Running test jdk_math_j9_0 ...
00:26:18 ===============================================
00:26:18 jdk_math_j9_0 Start Time: Sun Jun 21 00:26:18 2020 Epoch Time (ms): 1592717178048
00:26:18 "C:/Users/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_x86-64_windows/openjdkbinary/j2sdk-image\\bin\\java" -Xshareclasses:destroyAll; "C:/Users/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_x86-64_windows/openjdkbinary/j2sdk-image\\bin\\java" -Xshareclasses:groupAccess,destroyAll; echo "cache cleanup done";
00:26:18
00:26:18 Attempting to destroy all caches in cacheDir C:\Users\jenkins.TEST-WIN2012-2\AppData\Local\javasharedresources\
00:26:18
00:26:18 JVMSHRC241E Error: unable to delete shared class cache file
00:26:18 JVMSHRC336E Port layer error code = -100
00:26:18 JVMSHRC337E Platform error message: (32) The process cannot access the file because it is being used by another process.
00:26:18 JVMSHRC430I Failed to remove current generation of shared class cache "sharedcc_jenkins"
00:26:18
00:26:18 Attempting to destroy all caches in cacheDir C:\Users\jenkins.TEST-WIN2012-2\AppData\Local\javasharedresources\
00:26:18
00:26:18 JVMSHRC241E Error: unable to delete shared class cache file
00:26:18 JVMSHRC336E Port layer error code = -100
00:26:18 JVMSHRC337E Platform error message: (32) The process cannot access the file because it is being used by another process.
00:26:18 JVMSHRC430I Failed to remove current generation of shared class cache "sharedcc_jenkins"
See the console log for more instances of this occurring and also at the end of the build, it fails to clean up (probably due to the above failures):
[WS-CLEANUP] Deleting project workspace...
[WS-CLEANUP] Deferred wipeout is disabled by the job configuration...
Cannot delete workspace: null
Option not to fail the build is turned on, so let's continue
22-Jun-2020
did not see this error and failed after pumping it's own javacore out. In addition, the cleanws step passed this time which implies that it had been cleaned up previously so it's possible I have missed something here. The troublesome java process that wasn't cleaned up was created in the first run as the file path in the test_ouput matches the one reported by @Willsparker in https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1410#issuecomment-648698797 and I also see the exact same final words in the javacore file of the first run as I do in #1410 .
The only thing that confuses me is why didn't the second run fail with the error reported in #1410 ? Why did it take another build before we saw this error?
A temp solution for this could be to ensure we always force cleanup the workspaces on the test machines PRIOR to running any tests but if the file cannot be deleted by Jenkins or Admin then I'm not sure how we would go about doing that.
Hope this helps! :)
A temp solution for this could be to ensure we always force cleanup the workspaces on the test machines PRIOR to running any tests but if the file cannot be deleted by Jenkins or Admin then I'm not sure how we would go about doing that.
So we do actually clean up prior to running tests but it's not forced and I don't think there is a way for cleanws()
to be forced
I think the way this needs to be resolved is to ensure all the Java processes have stopped after any given test run - no process, no locked file and cleanWS()
should be able to do it's job. My concern is that Jenkins really should be doing it by itself due to ProcessTreeKiller.
This is linked to https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/770 - The outcome of which has led to this Jenkins Job : https://ci.adoptopenjdk.net/job/SXA-processCheck/
The problem is that this job only runs once every 7 days, which is evidently not enough to catch the rogue processes. I suppose we could just cause every test job to trigger a processCheck
run once it's completed, whether it's finished or not.
ping @sxa
Ref: #1415 The referenced issue will be a good way of keeping a track of the files that aren't able to be cleaned-up automatically by Jenkins
Ref: https://ci.adoptopenjdk.net/view/Failing%20Builds/job/build-scripts/job/jobs/job/jdk/job/jdk-windows-x64-openj9/69/console (from the issue @M-Davies opened above)
Affected build-softlayer-win2012r2-x64-1
, directory C:/workspace/openjdk-build
. I was unable to delete it manually with the Administrator
user, but was able to with the Jenkins
user, weirdly. Haven't seen that one before, as usually the permissions one is due to a leftover process still using the folder, which has to be stopped and then deleted.
A file is locked by a process, and often both Administrator and Jenkins can't delete them due to not having permissions. To rectify this, the file path can be searched in the Associated Handles section, under the CPU tab, in the resource monitor. Once the processes that are using those files are found, they can be right clicked and forced exited. The file can then be deleted by the Administrator user.
@Willsparker https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk8_j9_sanity.openjdk_x86-32_windows/277/console sees test-softlayer-win2012r2-x64-2
reproduce this error
@M-Davies Right you are- java.exe
process has been stopped & the folder has been manually deleted- thanks :-)
@sxa Can we make it so your SXA-processCheck
job kicks off for that machine after every test run?
@Willsparker test-azure-win2012r2-x64-1
has the same problem as java.exe not being stopped https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk8_j9_sanity.openjdk_x86-64_windows_xl/256/console
This is a new one - there's 10 java.exe
processes running... Also I like this output:
Also: java.nio.file.AccessDeniedException: **C:\Users\jenkins\workspace\Test_openjdk8_j9_sanity.openjdk_x86-64_windows_xl\openjdkbinary\j2sdk-image\bin\java.exe**
Highly useful 😁
I stopped a handful and then the rest stopped, and deleted C:\Users\jenkins\workspace\Test_openjdk8_j9_sanity.openjdk_x86-64_windows_xl\openjdkbinary\
manually 👍
Ref: https://github.com/AdoptOpenJDK/openjdk-build/pull/2076#issuecomment-691131667
Machine was: build-azure-win2012r2-x64-2
Initial thoughts is that the run was trying to delete a directory that was in use in a different run- however there's only one executor on the machine so that couldn't happen unless someone was running a build locally on the machine. When I got to it this morning, I was able to delete the folder without any issues, so whichever process was using those folders stopped over the weekend.
17:02:45 Compiling 2 files for BUILD_DEMO_APPLET_GraphicsTest
17:02:45 Compiling 1 files for BUILD_DEMO_APPLET_NervousText
17:02:45 Compiling 1 files for BUILD_DEMO_APPLET_SimpleGraph
17:02:45 Compiling 5 files for BUILD_DEMO_APPLET_SortDemo
17:02:45 Could not create temporary file name for at file!
17:02:45 make[2]: *** [CompileDemos.gmk:75: /cygdrive/e/jenkins/tmp/workspace/build/src/build/windows-x86_64-normal-server-release/jdk/demo/applets/SortDemo/_the.BUILD_DEMO_APPLET_SortDemo_batch] Error 127
17:02:45 make[2]: *** Waiting for unfinished jobs....
17:02:45 Compiling 1 files for BUILD_DEMO_APPLET_SpreadSheet
17:02:47 make[1]: *** [BuildJdk.gmk:93: demos] Error 2
17:02:47 make: *** [/cygdrive/e/jenkins/tmp/workspace/build/src//make/Main.gmk:133: demos-only] Error 2
Error comes from here: https://github.com/ibmruntimes/openj9-openjdk-jdk8/blob/f9b33244a76dfd671907872ad6800f3e26e6a4a7/common/src/fixpath.c#L254
I would suggest putting some extra debug into: https://github.com/ibmruntimes/openj9-openjdk-jdk8/blob/f9b33244a76dfd671907872ad6800f3e26e6a4a7/common/src/fixpath.c#L254 in a fork and building to recreate, we could do with the "rc" and the file "name"
We did some experimentation on a call earlier today and the atfile issue seems to have been caused by a lot of leftover atfile_
in the jenkins user's %TEMP%
directory - possibly from previous runs with the fixpath debug option enabled. Clearing these out on the machine (ibmcloud build machine 1) appears to have resolved that particular issue.
The changes we are implementing for https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1573 may well mean we can close this soon (Fingers crossed...)
Okay- with https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1573 & https://github.com/AdoptOpenJDK/openjdk-tests/pull/2059 , I haven't seen anymore leftover Java processes on Windows Boxes, issues (thank god). Closing :-)
Ref: https://github.com/AdoptOpenJDK/openjdk-build/issues/1855
These have been regularly cropping up in 2 forms:
A file such as the one referred to in the issue referenced above (a file called
..the_
) is unable to be deleted. When doing this manually through File Explorer, it errors saying the file can't be found and to verify it's location. It can be cleared off by removing it using the Cygwin terminal.A file is locked by a process, and often both
Administrator
andJenkins
can't delete them due to not having permissions. To rectify this, the file path can be searched in theAssociated Handles
section, under theCPU
tab, in theresource monitor
. Once the processes that are using those files are found, they can be right clicked and forced exited. The file can then be deleted by theAdministrator
user.This issue is going to act as a way of tracking those sorts of issues to determine if there's a pattern / cause and a way of fixing it so a cleanup isn't required.