egnor commented 1 year ago

Expectation / Proposal

The expectation is that shell_run_command should be able to run any reasonable process.

Traceback / Example

To reproduce, use shell_run_command to invoke a simple script like this

#!/usr/bin/env python3
import sys
for i in range(10000):
    print("*" * 40, i, file=sys.stderr)

At least in our experience, the command (and thus the shell_run_command function) blocks forever, which is unfortunate and undesired. I believe the reason is this logic in shell_run_command: https://github.com/PrefectHQ/prefect-shell/blob/1d89814710c35cf5f12832d9188f1d8d7c59b048/prefect_shell/commands.py#L97

In summary, the logic is

run the command in a subprocess (using anyio.open_process)
read the process's stdout output (and log it) until EOF
wait for the process to terminate
if the process had an error return, read the process's stderr for the error message

Since stdout and stderr flow through pipes of finite size, that's a real problem. In that main read loop, nothing is reading the process's stderr pipe, which means after a little bit of stderr writing the process will block waiting for the pipe. Then shell_run_command will block waiting for the process to close its stdout, which will never happen, and the whole thing deadlocks.

(One thing I don't understand: anyio.open_process is called without stdout or stderr specified, so I'd expect them not to be redirected at all, in which case .stdout and .stderr should be None and the whole thing should always fail, but clearly it doesn't, so I guess somehow they are set up to be pipes?? Certainly when running on the system they seem to be pipes.)

A workaround is to use ShellOperation instead, which reads and logs both stdout and stderr concurrently.

[x] I would like to help contribute a pull request to resolve this (but I am completely unfamiliar with this codebase, so probably wouldn't be very good at it)

ahuang11 commented 1 year ago

Thanks for debugging and reporting this! I wanted to confirm whether ShellOperation works for you?

Also, I don't think I was able to reproduce using:

other.py

from prefect_shell import shell_run_command
from prefect import flow

@flow
def test_flow():
    shell_run_command("python test.py")

test_flow()

test.py

#!/usr/bin/env python3
import sys
for i in range(10000):
    print("*" * 40, i, file=sys.stderr)

python other.py

egnor commented 1 year ago

ShellOperation definitely works and so we moved to using it.

I'll see if I can make a contained repro case for you...! Might be some other key ingredient. (@sckott FYI)

ahuang11 commented 1 year ago

Thanks! I think we may be able to refactor the internals of shell_run_command to use ShellOperation block eventually.

sckott commented 1 year ago

Might be some other key ingredient.

Nothing I can think of. Thx for opening the issue @egnor

I think we may be able to refactor the internals of shell_run_command to use ShellOperation block eventually

This seems like a good solution

kl-thamm commented 1 year ago

I have had the same issue - however I am not sure if this has actually something to do with printing to stderr or if it is just because of printing or if it is a result of just a long running process itself.

I tried building a minimal example that also includes (the probably weird) way I set up the agent: minimal-example-prefect.zip

As in the previous responses ShellOperation seems to run better.

egnor commented 1 year ago

Thanks for debugging and reporting this! I wanted to confirm whether ShellOperation works for you? Also, I don't think I was able to reproduce using: ...

When I try your example @ahuang11, exactly as you write it, it does hang here:

% python3 other.py
21:56:18.489 | INFO    | prefect.engine - Created flow run 'valiant-cormorant' for flow 'test-flow'
21:56:18.493 | INFO    | Flow run 'valiant-cormorant' - View at https://app.prefect.cloud/account/33d360a7-efaa-46ce-98ac-22df4011d411/workspace/2edd22be-f53a-4d0e-93dc-66e772f636a1/flow-runs/flow-run/a51901a4-55a0-475d-ba79-3a41af44a22c
21:56:19.512 | INFO    | Flow run 'valiant-cormorant' - Created task run 'shell_run_command-0' for task 'shell_run_command'
21:56:19.515 | INFO    | Flow run 'valiant-cormorant' - Executing 'shell_run_command-0' immediately...

I'm not sure why it's not hanging for you @ahuang11 -- perhaps on your system, shell pipes have unbounded length? I'm running on linux (Ubuntu 22.04, Linux 5.19.0-43-generic).

PrefectHQ / prefect

Process under `shell_run_command` hangs if it generates much stderr #13081

Expectation / Proposal

Traceback / Example