Open MattDelac opened 1 year ago
Thanks for submitting an issue @MattDelac! Do you have an example setup that we can use to reproduce this issue? In particular, sharing how your work pool is configured and the command that you use to start your worker would be helpful.
The work pool is just a Prefect agent
And this is my startup script used in a compute engine VM
apt-get update -qy
apt-get install -y python3 python3-pip
curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
bash add-google-cloud-ops-agent-repo.sh --also-install
python3 -m pip install --upgrade pip wheel
pip install "prefect==2.10.*" "prefect-gcp"
prefect cloud login --key ${prefect_auth_key} --workspace mdelacourmedelysfr/medelys
PREFECT_API_ENABLE_HTTP2=false PREFECT_LOGGING_LEVEL=DEBUG prefect agent start --pool default-agent-pool --work-queue medelys-default
The worker does not seem to die in that traceback — it just logs the error. Can you include more logs indicating that the worker is dead?
The worker does not seem to die in that traceback — it just logs the error. Can you include more logs indicating that the worker is dead?
As I posted here https://github.com/PrefectHQ/prefect/issues/7442#issuecomment-1533578629, the worker is "waiting from Cloud" but Cloud says that the worker in unhealthy 🤷
The worker does not seem to die in that traceback — it just logs the error. Can you include more logs indicating that the worker is dead?
And you're right, the worker might not die per say but Cloud thinks it became unhealthy for reasons I cannot figure out
Gotcha, looks like you're using an agent with a CloudRun infrastructure block. Could you share your CloudRun block configuration? Also, can your agent continue picking up flow runs after this error, or does it stop picking up flow runs?
Gotcha, looks like you're using an agent with a CloudRun infrastructure block. Could you share your CloudRun block configuration? Also, can your agent continue picking up flow runs after this error, or does it stop picking up flow runs?
Flows are getting pilled up and marked as "late"
Gotcha, looks like you're using an agent with a CloudRun infrastructure block. Could you share your CloudRun block configuration? Also, can your agent continue picking up flow runs after this error, or does it stop picking up flow runs?
And looking at the logs, the agent just waits
And restarting nor recreating the VM does not fix it (or it fixes it once every 10 times 🤷 )
Gotcha, looks like you're using an agent with a CloudRun infrastructure block. Could you share your CloudRun block configuration? Also, can your agent continue picking up flow runs after this error, or does it stop picking up flow runs?
@desertaxle Is there a way to share a JSON config or something nicer and more verbose?
Ok @desertaxle, the problem is not that the agent dies but here is the behavior I have
So yeah, the real fix here is to ensure that the timeout is respected and maybe to have Prefect Cloud checks if the jobs run once an hour, for example. It might help Prefect Cloud cleans up its internal state of the "running jobs"
Also Prefect Cloud cannot keep track properly of the jobs ... This is really weird
I don't have any job running when checking in GCP. Only Prefect Cloud thinks that the jobs are still running.
Expectation / Proposal
Original conversation The worker dies because some tasks run longer than the timeout set up.
Traceback / Example
This is a separate issue, please open a question in the prefect-gcp repository if you want to discuss that further. It looks like your flow is running longer than the default timeout. See that piece of code.