PrefectHQ / prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
https://prefect.io
Apache License 2.0
17.4k stars 1.64k forks source link

Agent failed to start during "Checking for cancelled flow runs" #8960

Closed krasoffski closed 1 year ago

krasoffski commented 1 year ago

First check

Bug summary

Occasionally prefect2 agent fails to start on step:

11:19:31.410 | DEBUG   | prefect.client - Connecting to API at http://localhost:4200/api/

  ___ ___ ___ ___ ___ ___ _____     _   ___ ___ _  _ _____
 | _ \ _ \ __| __| __/ __|_   _|   /_\ / __| __| \| |_   _|
 |  _/   / _|| _|| _| (__  | |    / _ \ (_ | _|| .` | | |
 |_| |_|_\___|_| |___\___| |_|   /_/ \_\___|___|_|\_| |_|

Agent started! Looking for work from queue(s): debug...
11:19:31.412 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
11:19:31.412 | DEBUG   | prefect.agent - Checking for cancelled flow runs...
Traceback (most recent call last):
  File "/home/yk/.pyenv/versions/3.10.8/envs/coordinator/lib/python3.10/site-packages/prefect/cli/_utilities.py", line 41, in wrapper
    return fn(*args, **kwargs)
  File "/home/yk/.pyenv/versions/3.10.8/envs/coordinator/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 230, in coroutine_wrapper
    return run_async_in_new_loop(async_fn, *args, **kwargs)
  File "/home/yk/.pyenv/versions/3.10.8/envs/coordinator/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 181, in run_async_in_new_loop
    return anyio.run(partial(__fn, *args, **kwargs))
  File "/home/yk/.pyenv/versions/3.10.8/envs/coordinator/lib/python3.10/site-packages/anyio/_core/_eventloop.py", line 70, in run
    return asynclib.run(func, *args, **backend_options)
  File "/home/yk/.pyenv/versions/3.10.8/envs/coordinator/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 292, in run
    return native_run(wrapper(), debug=debug)
  File "/home/yk/.pyenv/versions/3.10.8/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/home/yk/.pyenv/versions/3.10.8/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/yk/.pyenv/versions/3.10.8/envs/coordinator/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 287, in wrapper
    return await func(*args)
  File "/home/yk/.pyenv/versions/3.10.8/envs/coordinator/lib/python3.10/site-packages/prefect/cli/agent.py", line 189, in start
    async with anyio.create_task_group() as tg:
  File "/home/yk/.pyenv/versions/3.10.8/envs/coordinator/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 662, in __aexit__
    raise exceptions[0]
  File "/home/yk/.pyenv/versions/3.10.8/envs/coordinator/lib/python3.10/site-packages/prefect/utilities/services.py", line 46, in critical_service_loop
    await workload()
  File "/home/yk/.pyenv/versions/3.10.8/envs/coordinator/lib/python3.10/site-packages/prefect/agent.py", line 276, in check_for_cancelled_flow_runs
    typed_cancelling_flow_runs = await self.client.read_flow_runs(
  File "/home/yk/.pyenv/versions/3.10.8/envs/coordinator/lib/python3.10/site-packages/prefect/client/orchestration.py", line 1689, in read_flow_runs
    response = await self._client.post(f"/flow_runs/filter", json=body)
  File "/home/yk/.pyenv/versions/3.10.8/envs/coordinator/lib/python3.10/site-packages/httpx/_client.py", line 1848, in post
    return await self.request(
  File "/home/yk/.pyenv/versions/3.10.8/envs/coordinator/lib/python3.10/site-packages/httpx/_client.py", line 1533, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "/home/yk/.pyenv/versions/3.10.8/envs/coordinator/lib/python3.10/site-packages/prefect/client/base.py", line 253, in send
    response.raise_for_status()
  File "/home/yk/.pyenv/versions/3.10.8/envs/coordinator/lib/python3.10/site-packages/prefect/client/base.py", line 130, in raise_for_status
    raise PrefectHTTPStatusError.from_httpx_error(exc) from exc.__cause__
prefect.exceptions.PrefectHTTPStatusError: Client error '422 Unprocessable Entity' for url 'http://localhost:4200/api/flow_runs/filter'
Response: {'exception_message': 'Invalid request received.', 'exception_detail': [{'loc': ['body', 'flow_runs', 'state', 'type', 'any_', 0], 'msg': "value is not a valid enumeration member; permitted: 'SCHEDULED', 'PENDING', 'RUNNING', 'COMPLETED', 'FAILED', 'CANCELLED', 'CRASHED', 'PAUSED'", 'type': 'type_error.enum', 'ctx': {'enum_values': ['SCHEDULED', 'PENDING', 'RUNNING', 'COMPLETED', 'FAILED', 'CANCELLED', 'CRASHED', 'PAUSED']}}], 'request_body': {'flows': None, 'flow_runs': {'operator': 'and_', 'id': {'any_': None, 'not_any_': []}, 'name': None, 'tags': None, 'deployment_id': None, 'work_queue_name': {'operator': 'and_', 'any_': [], 'is_null_': None}, 'state': {'operator': 'and_', 'type': {'any_': ['CANCELLING']}, 'name': None}, 'flow_version': None, 'start_time': None, 'expected_start_time': None, 'next_scheduled_start_time': None, 'parent_task_run_id': None}, 'task_runs': None, 'deployments': None, 'work_pools': None, 'work_pool_queues': None, 'sort': None, 'limit': None, 'offset': 0}}
For more information check: https://httpstatuses.com/422
An exception occurred.

Reproduction

It is fully server intercommunication, no interaction from user side.

Error

Server complains on request performed by prefect agent.


{'exception_message': 'Invalid request received.',
 'exception_detail': [{'loc': ['body',
    'flow_runs',
    'state',
    'type',
    'any_',
    0],
   'msg': "value is not a valid enumeration member; permitted: 'SCHEDULED', 'PENDING', 'RUNNING', 'COMPLETED', 'FAILED', 'CANCELLED', 'CRASHED', 'PAUSED'",
   'type': 'type_error.enum',
   'ctx': {'enum_values': ['SCHEDULED',
     'PENDING',
     'RUNNING',
     'COMPLETED',
     'FAILED',
     'CANCELLED',
     'CRASHED',
     'PAUSED']}}],
 'request_body': {'flows': None,
  'flow_runs': {'operator': 'and_',
   'id': {'any_': None, 'not_any_': []},
   'name': None,
   'tags': None,
   'deployment_id': None,
   'work_queue_name': {'operator': 'and_', 'any_': [], 'is_null_': None},
   'state': {'operator': 'and_',
    'type': {'any_': ['CANCELLING']},
    'name': None},
   'flow_version': None,
   'start_time': None,
   'expected_start_time': None,
   'next_scheduled_start_time': None,
   'parent_task_run_id': None},
  'task_runs': None,
  'deployments': None,
  'work_pools': None,
  'work_pool_queues': None,
  'sort': None,
  'limit': None,
  'offset': 0}}

Versions

Server version

Version:             2.7.9
API version:         0.8.4
Python version:      3.10.9
Git commit:          42b80f18
Built:               Thu, Jan 19, 2023 4:59 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         ephemeral
Server:
  Database:          sqlite
  SQLite version:    3.34.1

Agent version

Version:             2.8.5
API version:         0.8.4
Python version:      3.10.10
Git commit:          81a67202
Built:               Thu, Mar 9, 2023 4:27 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         server

Additional context

Despite the fact, that API version is the same for server and agent, the underlying layers are incompatible.

zanieb commented 1 year ago

Hey @krasoffski the agent is asking for "CANCELLING" flow runs which is a new state type that the server does not know about and does not support. We do not recommend running clients at a newer version than the server.

See https://docs.prefect.io/contributing/versioning/#client-compatibility-with-prefect for more details.

We should have bumped the internal "API" version when the change was made, but we're not using that in a meaningful way right now.

krasoffski commented 1 year ago

Hi @madkinsz , thank you for reply.

I also checked source code and also figured out this.

Generally speaking we don't use different version for agent and server for obvious reasons. But accidentally didn't freeze exact prefect version in requirements but only release version and as a result after redeploy got this behavior.

The only probably peace of advice to hide API version from output as it doesn't behave as expected and might confuse someone else.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment.

github-actions[bot] commented 1 year ago

This issue was closed because it has been stale for 14 days with no activity. If this issue is important or you have more to add feel free to re-open it.