adoptium / STF

The System Test Framework for executing https://github.com/adoptium/aqa-systemtest
Apache License 2.0
8 stars 35 forks source link

Tests fail with Could not terminate process PID=nnnn: Access is denied #93

Closed lumpfish closed 3 years ago

lumpfish commented 3 years ago

Tests fail with Could not terminate process PID=nnnn: Access is denied

https://github.com/eclipse/openj9/issues/11021 describes tests which fail because tests fail from time to time with Could not terminate process PID=nnnn: Access is denied..

At some point code was added to https://github.com/AdoptOpenJDK/stf/blob/master/stf.core/scripts/stf/Commands.pm to kill Windows processes using pskill. Later the code was modified to use Stop-Process rather than pskill if it is available (https://github.com/AdoptOpenJDK/stf/pull/11). However, in investigating https://github.com/eclipse/openj9/issues/11021 it looks like neither StopProcess nor pskill are necessary, and using them leads to the error reported here.

The stf::ProcessMgnt perl library contains functions to start, stop, kill and poll processes for both Windows and unix based operating systems. https://github.com/AdoptOpenJDK/stf/blob/master/stf.core/scripts/stf/ProcessMgmt/Windows.pm contains a terminate() function which uses the Win32 API directly to terminate a process. In the event that the process killed by the StopProcess code does not terminate within two seconds of the StopProcess command being issued, the terminate() function is called (via the code path which is common for both Windows and unix). It is the Win32::TerminateProcess function which outputs the Could not terminate process PID=nnnn: Access is denied message, not the Stop-Process function, presumably because the process is already being terminated and is no longer in a state which can be termianted.

This is a failing test with some additional debugging code added:

Case 1:

Case 2:

Case 3:

One way to fix this would be to wait longer for the process to be termianted as a result of the Stop-Process command. But if the Stop-Process / pskill code is removed so that the terminate() function in the ProcessMgmt module is used instead (which is after all the point of the ProcessMgmt library) then the kill_process function works as expected anyway. So that (removing the StopProcess / pskill code) would be the simpler and more elegant solution.

Mesbah-Alam commented 3 years ago

At some point code was added to https://github.com/AdoptOpenJDK/stf/blob/master/stf.core/scripts/stf/Commands.pm to kill Windows processes using pskill. Later the code was modified to use Stop-Process rather than pskill if it is available (#11).

What was the original reasons for adding pskill and Stop-Process? Did we not always have the terminate() function which uses the Win32 API directly in our disposal?

Just trying to understand if going back to using terminate() function would have any side effects.

lumpfish commented 3 years ago

Since we know why Stop-Process was added the question is really 'why was pskill added?'. I don't know the answer to that, except to surmise that maybe the writer of Commands.pm did not realise that the terminate() function was also implemented for Windows (Commands.pm was added after the ProcessMgmt library). I have created https://github.com/AdoptOpenJDK/stf/pull/94 with Stop-Process / pskill removed and I'm currently running sanity.system across all releases to test. @Mesbah-Alam - please feel free to run any local testing too. Remember the change only affects Windows.

lumpfish commented 3 years ago

Just trying to understand if going back to using terminate() function would have any side effects.

That's the motivation for running the sanity.system tests across all the releases. We do know that using Stop-Process has the side effect reported by https://github.com/eclipse/openj9/issues/11021.

lumpfish commented 3 years ago

Test runs of sanity,system using https://github.com/AdoptOpenJDK/stf/pull/94 (all processes killed successfully where necessary): sanity.system

jdk8 hotspot: win_x86-64: https://ci.adoptopenjdk.net/job/Grinder/5057/console win_x86-32: https://ci.adoptopenjdk.net/job/Grinder/5056/console openj9: win_x86-64: https://ci.adoptopenjdk.net/job/Grinder/5052/console win_x86-64_xl: https://ci.adoptopenjdk.net/job/Grinder/5053/console win_x86-32: https://ci.adoptopenjdk.net/job/Grinder/5054/console

jdk11 hotspot: win_x86-64: https://ci.adoptopenjdk.net/job/Grinder/5059/console win_x86-32: https://ci.adoptopenjdk.net/job/Grinder/5060/console openj9: win_x86-64: https://ci.adoptopenjdk.net/job/Grinder/5062/console win_x86-64_xl: https://ci.adoptopenjdk.net/job/Grinder/5063/console

jdk15 hotspot: win_x86-64: https://ci.adoptopenjdk.net/job/Grinder/5072/console win_x86-32: https://ci.adoptopenjdk.net/job/Grinder/5073/console openj9: win_x86-64: https://ci.adoptopenjdk.net/job/Grinder/5069/console win_x86-64_xl: https://ci.adoptopenjdk.net/job/Grinder/5070/console

lumpfish commented 3 years ago

This occurred on an AdoptOpenJDK build yesterday: https://ci.adoptopenjdk.net/job/Test_openjdk11_j9_sanity.system_x86-64_windows_xl/408/consoleFull, so it's affecting tests at AdoptOpenJDK as well as Eclipse/openj9:

14:59:04  STF 14:59:02.697 - Killing processes: LT1
14:59:05  STF 14:59:04.459 - Using powershell to kill LT1
14:59:05  STF 14:59:04.459 - Running command: cmd /c echo | powershell Stop-Process 40052
14:59:05  STF 14:59:04.459 - Redirecting stderr to C:/Users/jenkins/workspace/Test_openjdk11_j9_sanity.system_x86-64_windows_xl/openjdk-tests/TKG/test_output_1607350846523/TestJlmRemoteMemoryNoAuth_0/20201207-145214-TestJlmRemoteMemoryNoAuth/results\kill_powershell.stderr
14:59:05  STF 14:59:04.460 - Redirecting stdout to C:/Users/jenkins/workspace/Test_openjdk11_j9_sanity.system_x86-64_windows_xl/openjdk-tests/TKG/test_output_1607350846523/TestJlmRemoteMemoryNoAuth_0/20201207-145214-TestJlmRemoteMemoryNoAuth/results\kill_powershell.stdout
14:59:07  Could not terminate process PID=40052: Access is denied.

14:59:07  **FAILED** at step 4 (Stop LT1 process). Expected return value=0 Actual=1 at C:/Users/jenkins/workspace/Test_openjdk11_j9_sanity.system_x86-64_windows_xl/openjdk-tests/\TKG\test_output_1607350846523\TestJlmRemoteMemoryNoAuth_0/20201207-145214-TestJlmRemoteMemoryNoAuth/execute.pl line 177.
14:59:07  STF 14:59:06.945 - **FAILED** execute script failed. Expected return value=0 Actual=1

@Mesbah-Alam - are you OK for me to go ahead and merge the PR?

Mesbah-Alam commented 3 years ago

@lumpfish The PR was merged. Thank you!