Open paulinjo opened 1 year ago
@paulinjo we need a MRE in order to investigate this
@madkinsz This is happening in our production environment from previously well behaved flows beginning on roughly 04/26.
Coming up with an MRE is going to be a big challenge considering the number of moving pieces, but we were told byu support to open a Github issue for further investigation.
@paulinjo Did you change Prefect versions..? Can you share the actual full output of prefect version
from a production environment?
Agent: 2.10.6
Flow pod: 2.10.6
Server is Prefect cloud
@paulinjo did you change Prefect versions when this started occurring?
When this started occurring we were using 2.9.x
for the agent and flows. We since updated to 2.10.6
to see if that was the cause, but no change.
I'm also experiencing this error after upgrading from prefect 2.10.5
to 2.10.8
with Python 3.8 with Agents deploying Kubernetes jobs. This happens infrequently, but whenever it does the job appears to be scheduled and run twice (All logs appear twice, including Downloading flow code storage at ''
) with one copy completing successfully and another failing to start a subflow run with the above stack trace. All tasks and subflows are otherwise marked as successful in the UI and complete successfully.
Here's a general approximation of what my flows/subflos look like, but I unfortunately can't get it to reproduce on my local machine.
import time
from prefect import flow, task, get_run_logger
@task
def task0(inpt):
for inp in inpt:
get_run_logger().info(inp)
time.sleep(1)
return inpt
@task
def task1(inp):
get_run_logger().info(inp)
time.sleep(1)
return True
@flow
def subflow1(inpt):
future = task0.submit(inpt)
future.wait()
future = task2.submit(future.result())
return future.result()
@task
def task2(val):
get_run_logger().info(val)
return 0
@flow
def subflow2(val):
future = task2.submit(val)
future.wait()
return future.result()
@flow
def main():
get_run_logger().info("Hello World")
future = task0.submit(list(range(0,10)))
future.wait()
subflow1(future)
subflow2(0)
return future.result()
if __name__ == "__main__":
main()
It's possible this is related to a Cloud bug where the flow run was placed in a late state after it started running. We've released a fix for that now. If you can share some affected flow run and workspace ids, we can check if that was the cause for you.
Workspace: e5803b62-2267-4084-bd2a-ca0213518464
Flow Runs:
7e2d0415-52b6-4844-a323-8a496c9635a5
a7ad7d43-7edb-42e5-95c4-eb40ede24748
451fac39-8291-4488-b1fa-631a0c2492a4
@madkinsz My jobs are running on self-hosted infrastructure. Will fix also be replicated for non-cloud setups?
@majikman111 we have not seen cases of the bug I described in the OSS but we are replicating a fix anyway yes.
@paulinjo I've confirmed that you are not affected by the described bug. This seems like something else. It looks like the parent is failing because it tries to retrieve the result of an subflow run that has previously completed but the result is not persisted.
@madkinsz Is the implication that there's something we need to update to fix this? The flows and underlying infrastructure have not been changed in months, excluding Prefect and other dependency changes and several other flows are running without issue.
After some digging in, I found that this was related to a bad configuration on our side. Both our docker container entrypoint and our kubernetes job block were making a call to python -m prefect.engine
, causing every flow to be executed twice in a container.
This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment.
Hi @paulinjo thank you for the update - is it safe to close this issue now?
We are experimenting the same issue with this (old) version in production
prefect version
Version: 2.9.0
API version: 0.8.4
Python version: 3.10.8
Git commit: 69f57bd5
Built: Thu, Mar 30, 2023 1:08 PM
OS/Arch: linux/x86_64
Profile: objectStorage
Server type: server
We still have running 2.9.x and we found regularly this kind of stacktrace after clicking "Retry" from the UI. As a result, we cannot continue our flow anymore and are totally stucked in flow progression.
Full trace from the logs in the UI:
Encountered exception during execution:
Traceback (most recent call last):
File "/prefect/lib/python3.10/site-packages/prefect/engine.py", line 674, in orchestrate_flow_run
result = await flow_call.aresult()
File "/prefect/lib/python3.10/site-packages/prefect/_internal/concurrency/calls.py", line 181, in aresult
return await asyncio.wrap_future(self.future)
File "/prefect/lib/python3.10/site-packages/prefect/_internal/concurrency/calls.py", line 194, in _run_sync
result = self.fn(*self.args, **self.kwargs)
File "/tmp/tmp4bdjm1a2prefect/reboot.py", line 34, in reboot_flow
set_maintenance(server_id)
File "/prefect/lib/python3.10/site-packages/prefect/tasks.py", line 485, in __call__
return enter_task_run_engine(
File "/prefect/lib/python3.10/site-packages/prefect/engine.py", line 977, in enter_task_run_engine
return from_sync.wait_for_call_in_loop_thread(begin_run)
File "/prefect/lib/python3.10/site-packages/prefect/_internal/concurrency/api.py", line 137, in wait_for_call_in_loop_thread
return call.result()
File "/prefect/lib/python3.10/site-packages/prefect/_internal/concurrency/calls.py", line 173, in result
return self.future.result(timeout=timeout)
File "/opt/python-3.10.8-ovh161/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/opt/python-3.10.8-ovh161/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/prefect/lib/python3.10/site-packages/prefect/_internal/concurrency/calls.py", line 218, in _run_async
result = await coro
File "/prefect/lib/python3.10/site-packages/prefect/engine.py", line 1132, in get_task_call_return_value
return await future._result()
File "/prefect/lib/python3.10/site-packages/prefect/futures.py", line 240, in _result
return await final_state.result(raise_on_failure=raise_on_failure, fetch=True)
File "/prefect/lib/python3.10/site-packages/prefect/states.py", line 98, in _get_state_result
result = await state.data.get()
File "/prefect/lib/python3.10/site-packages/prefect/results.py", line 394, in get
raise MissingResult("The result was not persisted and is no longer available.")
prefect.exceptions.MissingResult: The result was not persisted and is no longer available.
We have tasks that are supposed to persist their Result using @task(persist_result=True)
but it seems not functional as intended.
First check
Bug summary
A subset of our Flows which make use of subflows are incorrectly reporting their terminal state as Failed, even when all subflows and tasks are completed.
These flows are triggered via an separate flow using the orchestrator pattern, and this orchestrator flow behaves as though the terminal state is Completed.
Logs from the agent running on EKS show the state initially reported as Success before switching to Failed.
Reproduction
Error
Versions
Additional context
No response