aws-deadline / deadline-cloud-worker-agent

The AWS Deadline Cloud worker agent can be used to run a worker in an AWS Deadline Cloud fleet.
Apache License 2.0
15 stars 21 forks source link

Bug: NOTIFY_THEN_TERMINATE cancelation mode on Windows fails to send CTLR_BREAK_EVENT #490

Open jusiskin opened 1 day ago

jusiskin commented 1 day ago

Describe Behaviour

When job is submitted that defines session actions to use the NOTIFY_THEN_TERMINATE cancelation mode (OpenJD ref docs), then canceling the job on a Windows worker worker fails to send the notify (graceful cancelation) OS signal (CTRL_BREAK_EVENT on Windows).

Expected Behaviour

The worker agent successfully sends the notify cancelation signal to session action subprocesses which can then perform graceful cancelation of the running workload.

Current Behaviour

The following two possible errors can be observed in the session log:

Handle invalid

2024/11/19 17:11:08-06:00 INTERRUPT: Sending CTRL_BREAK_EVENT to 4244
2024/11/19 17:11:08-06:00 Failed to send signal 'CTRL_BREAK_EVENT' to subprocess 4244: Traceback (most recent call last):
  File "C:\Program Files\Python311\Lib\site-packages\openjd\sessions\_scripts\_windows\_signal_win_subprocess.py", line 65, in <module>
    signal_process(int(sys.argv[1]))
  File "C:\Program Files\Python311\Lib\site-packages\openjd\sessions\_scripts\_windows\_signal_win_subprocess.py", line 46, in signal_process
    raise ctypes.WinError()
OSError: [WinError 6] The handle is invalid.

Access denied

2024/11/20 13:16:20-06:00 Canceling subprocess 1388 via notify then terminate method at 2024-11-20T19:16:20Z.
2024/11/20 13:16:20-06:00 Grace period ends at 2024-11-20T19:16:42Z
2024/11/20 13:16:20-06:00 INTERRUPT: Sending CTRL_BREAK_EVENT to 1388
2024/11/20 13:16:20-06:00 Failed to send signal 'CTRL_BREAK_EVENT' to subprocess 1388: Traceback (most recent call last):
  File "C:\Program Files\Python311\Lib\site-packages\openjd\sessions\_scripts\_windows\_signal_win_subprocess.py", line 65, in <module>
    signal_process(int(sys.argv[1]))
  File "C:\Program Files\Python311\Lib\site-packages\openjd\sessions\_scripts\_windows\_signal_win_subprocess.py", line 46, in signal_process
    raise ctypes.WinError()
PermissionError: [WinError 5] Access is denied.

In both cases, once the notify cancelation timeout is reached, the forceful cancelation (terminate) happens successfully:

2024/11/20 13:16:42-06:00 Notify period ended. Terminate at 2024-11-20T19:16:42Z
2024/11/20 13:16:42-06:00 INTERRUPT: Start killing the process tree with the root pid: 1388
2024/11/20 13:16:42-06:00 Killing process with id 992.
2024/11/20 13:16:42-06:00 Killing process with id 1388.
2024/11/20 13:16:42-06:00 Process pid 1388 exited with code: 15 (unsigned) / 0xf (hex)

The impact is that if the job was written with graceful cancelation handling, that handling will not happen on Windows. This can cause undesirable resource leaks or leftover side-effects.

Reproduction Steps

  1. Create a Windows fleet and associate with a queue
  2. Setup a Windows worker using the latest worker agent version
  3. Submit a job using the attached job template (win_long_sleep_paramspace_cancel.json)
  4. Wait for the job to be running on the worker
  5. Cancel the job
  6. Observe the session log for the above error

Possible Solution

Unknown

Package Version

0.27.3

Language Version

3.11.10

Dependencies

No response

Operating System

Windows

Other information

No response