When job is submitted that defines session actions to use the NOTIFY_THEN_TERMINATE cancelation mode (OpenJD ref docs), then canceling the job on a Windows worker worker fails to send the notify (graceful cancelation) OS signal (CTRL_BREAK_EVENT on Windows).
Expected Behaviour
The worker agent successfully sends the notify cancelation signal to session action subprocesses which can then perform graceful cancelation of the running workload.
Current Behaviour
The following two possible errors can be observed in the session log:
Handle invalid
2024/11/19 17:11:08-06:00 INTERRUPT: Sending CTRL_BREAK_EVENT to 4244
2024/11/19 17:11:08-06:00 Failed to send signal 'CTRL_BREAK_EVENT' to subprocess 4244: Traceback (most recent call last):
File "C:\Program Files\Python311\Lib\site-packages\openjd\sessions\_scripts\_windows\_signal_win_subprocess.py", line 65, in <module>
signal_process(int(sys.argv[1]))
File "C:\Program Files\Python311\Lib\site-packages\openjd\sessions\_scripts\_windows\_signal_win_subprocess.py", line 46, in signal_process
raise ctypes.WinError()
OSError: [WinError 6] The handle is invalid.
Access denied
2024/11/20 13:16:20-06:00 Canceling subprocess 1388 via notify then terminate method at 2024-11-20T19:16:20Z.
2024/11/20 13:16:20-06:00 Grace period ends at 2024-11-20T19:16:42Z
2024/11/20 13:16:20-06:00 INTERRUPT: Sending CTRL_BREAK_EVENT to 1388
2024/11/20 13:16:20-06:00 Failed to send signal 'CTRL_BREAK_EVENT' to subprocess 1388: Traceback (most recent call last):
File "C:\Program Files\Python311\Lib\site-packages\openjd\sessions\_scripts\_windows\_signal_win_subprocess.py", line 65, in <module>
signal_process(int(sys.argv[1]))
File "C:\Program Files\Python311\Lib\site-packages\openjd\sessions\_scripts\_windows\_signal_win_subprocess.py", line 46, in signal_process
raise ctypes.WinError()
PermissionError: [WinError 5] Access is denied.
In both cases, once the notify cancelation timeout is reached, the forceful cancelation (terminate) happens successfully:
2024/11/20 13:16:42-06:00 Notify period ended. Terminate at 2024-11-20T19:16:42Z
2024/11/20 13:16:42-06:00 INTERRUPT: Start killing the process tree with the root pid: 1388
2024/11/20 13:16:42-06:00 Killing process with id 992.
2024/11/20 13:16:42-06:00 Killing process with id 1388.
2024/11/20 13:16:42-06:00 Process pid 1388 exited with code: 15 (unsigned) / 0xf (hex)
The impact is that if the job was written with graceful cancelation handling, that handling will not happen on Windows. This can cause undesirable resource leaks or leftover side-effects.
Reproduction Steps
Create a Windows fleet and associate with a queue
Setup a Windows worker using the latest worker agent version
Describe Behaviour
When job is submitted that defines session actions to use the
NOTIFY_THEN_TERMINATE
cancelation mode (OpenJD ref docs), then canceling the job on a Windows worker worker fails to send the notify (graceful cancelation) OS signal (CTRL_BREAK_EVENT
on Windows).Expected Behaviour
The worker agent successfully sends the notify cancelation signal to session action subprocesses which can then perform graceful cancelation of the running workload.
Current Behaviour
The following two possible errors can be observed in the session log:
Handle invalid
Access denied
In both cases, once the notify cancelation timeout is reached, the forceful cancelation (terminate) happens successfully:
The impact is that if the job was written with graceful cancelation handling, that handling will not happen on Windows. This can cause undesirable resource leaks or leftover side-effects.
Reproduction Steps
win_long_sleep_paramspace_cancel.json
)Possible Solution
Unknown
Package Version
0.27.3
Language Version
3.11.10
Dependencies
No response
Operating System
Windows
Other information
No response