PrefectHQ / prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
https://prefect.io
Apache License 2.0
17.6k stars 1.65k forks source link

Worker dies because timeout is not respected #13060

Open MattDelac opened 1 year ago

MattDelac commented 1 year ago

Expectation / Proposal

Original conversation The worker dies because some tasks run longer than the timeout set up.

Traceback / Example

RuntimeError: Timed out after 602.8259875774384s while waiting for Cloud Run Job execution to complete. Your job may still be running on GCP.
An error occured while monitoring flow run 'cdbc0be6-c964-45b5-ba1c-fce2d4e36f17'. The flow run will not be marked as failed, but an issue may have occurred.

This is a separate issue, please open a question in the prefect-gcp repository if you want to discuss that further. It looks like your flow is running longer than the default timeout. See that piece of code.

desertaxle commented 1 year ago

Thanks for submitting an issue @MattDelac! Do you have an example setup that we can use to reproduce this issue? In particular, sharing how your work pool is configured and the command that you use to start your worker would be helpful.

MattDelac commented 1 year ago

The work pool is just a Prefect agent

image

And this is my startup script used in a compute engine VM

apt-get update -qy
apt-get install -y python3 python3-pip

curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
bash add-google-cloud-ops-agent-repo.sh --also-install

python3 -m pip install --upgrade pip wheel
pip install "prefect==2.10.*" "prefect-gcp"

prefect cloud login --key ${prefect_auth_key} --workspace mdelacourmedelysfr/medelys
PREFECT_API_ENABLE_HTTP2=false PREFECT_LOGGING_LEVEL=DEBUG prefect agent start --pool default-agent-pool --work-queue medelys-default
zanieb commented 1 year ago

The worker does not seem to die in that traceback — it just logs the error. Can you include more logs indicating that the worker is dead?

MattDelac commented 1 year ago

The worker does not seem to die in that traceback — it just logs the error. Can you include more logs indicating that the worker is dead?

As I posted here https://github.com/PrefectHQ/prefect/issues/7442#issuecomment-1533578629, the worker is "waiting from Cloud" but Cloud says that the worker in unhealthy 🤷

MattDelac commented 1 year ago

The worker does not seem to die in that traceback — it just logs the error. Can you include more logs indicating that the worker is dead?

And you're right, the worker might not die per say but Cloud thinks it became unhealthy for reasons I cannot figure out

desertaxle commented 1 year ago

Gotcha, looks like you're using an agent with a CloudRun infrastructure block. Could you share your CloudRun block configuration? Also, can your agent continue picking up flow runs after this error, or does it stop picking up flow runs?

MattDelac commented 1 year ago

Gotcha, looks like you're using an agent with a CloudRun infrastructure block. Could you share your CloudRun block configuration? Also, can your agent continue picking up flow runs after this error, or does it stop picking up flow runs?

image

Flows are getting pilled up and marked as "late"

MattDelac commented 1 year ago

Gotcha, looks like you're using an agent with a CloudRun infrastructure block. Could you share your CloudRun block configuration? Also, can your agent continue picking up flow runs after this error, or does it stop picking up flow runs?

image
MattDelac commented 1 year ago

And looking at the logs, the agent just waits

image

And restarting nor recreating the VM does not fix it (or it fixes it once every 10 times 🤷 )

MattDelac commented 1 year ago

Gotcha, looks like you're using an agent with a CloudRun infrastructure block. Could you share your CloudRun block configuration? Also, can your agent continue picking up flow runs after this error, or does it stop picking up flow runs?

@desertaxle Is there a way to share a JSON config or something nicer and more verbose?

MattDelac commented 1 year ago

Ok @desertaxle, the problem is not that the agent dies but here is the behavior I have

So yeah, the real fix here is to ensure that the timeout is respected and maybe to have Prefect Cloud checks if the jobs run once an hour, for example. It might help Prefect Cloud cleans up its internal state of the "running jobs"

MattDelac commented 1 year ago

Also Prefect Cloud cannot keep track properly of the jobs ... This is really weird

image image

I don't have any job running when checking in GCP. Only Prefect Cloud thinks that the jobs are still running.