PrefectHQ / prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
https://prefect.io
Apache License 2.0
17.08k stars 1.63k forks source link

"No such file or directory" - orion.db-wal #6333

Open drfraser opened 2 years ago

drfraser commented 2 years ago

First check

Bug summary

I am not exactly sure what is the cause of this bug, e.g. I can guess why orion.db-wal might be generated, but no idea what is going on internally.

My flows are running fine, they are pretty simple ones, and this error has occurred only once. Could it be my flows are overlapping in time and something is getting scrambled? So I need more background knowledge before I could help debug this.

The log message below was the only line associated with this run - it is as if at the very start of the flow, it threw an error

Reproduction

The flow has been running every hour for the past day without any issues, except for this one time.

Error

11:05:01.304 | ERROR   | Flow run 'quaint-mayfly' - Flow could not be retrieved from deployment.
Traceback (most recent call last):
  File "/home/railml/venv/lib/python3.8/site-packages/prefect/engine.py", line 247, in retrieve_flow_then_begin_flow_run
    flow = await load_flow_from_flow_run(flow_run, client=client)
  File "/home/railml/venv/lib/python3.8/site-packages/prefect/client.py", line 104, in with_injected_client
    return await fn(*args, **kwargs)
  File "/home/railml/venv/lib/python3.8/site-packages/prefect/deployments.py", line 47, in load_flow_from_flow_run
    await storage_block.get_directory(from_path=None, local_path=".")
  File "/home/railml/venv/lib/python3.8/site-packages/prefect/filesystems.py", line 98, in get_directory
    shutil.copytree(from_path, local_path, dirs_exist_ok=True)
  File "/usr/lib/python3.8/shutil.py", line 557, in copytree
    return _copytree(entries=entries, src=src, dst=dst, symlinks=symlinks,
  File "/usr/lib/python3.8/shutil.py", line 513, in _copytree
    raise Error(errors)
shutil.Error: [('/home/railml/etl/orion.db-wal', './orion.db-wal', "[Errno 2] No such file or directory: '/home/railml/etl/orion.db-wal'")]

Versions

Version:             2.0.3
API version:         0.8.0
Python version:      3.8.10
Git commit:          2f1cf4ac
Built:               Fri, Aug 5, 2022 3:57 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         hosted

Additional context

The flow before this one was in a bad state - there was a bug in my code that caused a subtask to enter a Pending state while the main flow process ended. I am not sure how Prefect should handle that scenario, but from what I read, Sqlite deletes db-wal files when all database connections have been removed.

What if the first flow ended while the second flow had already started, then the stuck subtask ends and ultimately the db-wal file gets deleted when it shouldn't have been (due to the first flow breaking)? Thus the second flow breaks with this error?

drfraser commented 2 years ago

I think something like what I described happened and as the root cause is the bug in my code, I feel this issue ought to be closed. I have tried to duplicate this problem and can't, so the only way might be to manually delete the db-wal file or set up a subtask/flow to do so or to explicitly go into Pending while the main flow ends.

peytonrunyan commented 2 years ago

Thanks a ton for the thorough issue and for trying to reproduce this! If you bump up against this again, please reopen this issue and we'll take a look. I'll also reopen if another user encounters this.

drfraser commented 2 years ago

It has occurred again - the flow being run was the only live one at the time and it did not even start properly given what the UI says (the subsequent runs of this flow for the next few days have all been fine)

Flow could not be retrieved from deployment.
Traceback (most recent call last):
  File "/home/railml/venv/lib/python3.8/site-packages/prefect/engine.py", line 247, in retrieve_flow_then_begin_flow_run
    flow = await load_flow_from_flow_run(flow_run, client=client)
  File "/home/railml/venv/lib/python3.8/site-packages/prefect/client.py", line 104, in with_injected_client
    return await fn(*args, **kwargs)
  File "/home/railml/venv/lib/python3.8/site-packages/prefect/deployments.py", line 46, in load_flow_from_flow_run
    await storage_block.get_directory(from_path=None, local_path=".")
  File "/home/railml/venv/lib/python3.8/site-packages/prefect/filesystems.py", line 98, in get_directory
    shutil.copytree(from_path, local_path, dirs_exist_ok=True)
  File "/usr/lib/python3.8/shutil.py", line 557, in copytree
    return _copytree(entries=entries, src=src, dst=dst, symlinks=symlinks,
  File "/usr/lib/python3.8/shutil.py", line 513, in _copytree
    raise Error(errors)
shutil.Error: [('/home/railml/etl/orion.db-wal', './orion.db-wal', "[Errno 2] No such file or directory: '/home/railml/etl/orion.db-wal'")]
01:16:05 PM
peytonrunyan commented 2 years ago

Thanks for the heads up! I've reopened the issue and I'll give this a more thorough look when I've got a moment.

drfraser commented 2 years ago

This happened again this morning, so if you can tell me where to put extra logging statements, I probably can give you more context in a day or two.

zanieb commented 2 years ago

This looks like a weird issue related to the temporary creation of a WAL file in the middle of the copytree operation. cc @cicdw

peytonrunyan commented 2 years ago

Is this one still in progress or resolved?

azizrh commented 1 year ago

I also occurred to me, can you guys help me with this? it happens while my flow already on production

Downloading flow code from storage at 'D:\\PREFECT \\HOME\\DIRECTORY'
06:00:00 AM

Flow could not be retrieved from deployment.
Traceback (most recent call last):
  File "C:\PYTHON_VIRTUAL_ENVIRONMENT\DIRECTORY\Documents\prefect_ENV_PRD\env\Lib\site-packages\prefect\engine.py", line 318, in retrieve_flow_then_begin_flow_run
    flow = await load_flow_from_flow_run(flow_run, client=client)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\PYTHON_VIRTUAL_ENVIRONMENT\DIRECTORY\Documents\prefect_ENV_PRD\env\Lib\site-packages\prefect\client\utilities.py", line 40, in with_injected_client
    return await fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\PYTHON_VIRTUAL_ENVIRONMENT\DIRECTORY\Documents\prefect_ENV_PRD\env\Lib\site-packages\prefect\deployments.py", line 197, in load_flow_from_flow_run
    await storage_block.get_directory(from_path=deployment.path, local_path=".")
  File "C:\PYTHON_VIRTUAL_ENVIRONMENT\DIRECTORY\Documents\prefect_ENV_PRD\env\Lib\site-packages\prefect\filesystems.py", line 147, in get_directory
    copytree(from_path, local_path, dirs_exist_ok=True)
  File "C:\Program Files\Python311\Lib\shutil.py", line 561, in copytree
    return _copytree(entries=entries, src=src, dst=dst, symlinks=symlinks,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python311\Lib\shutil.py", line 515, in _copytree
    raise Error(errors)
shutil.Error: [('D:\\PREFECT \\HOME\\DIRECTORY\\home\\prefect.db-shm', 'C:\\PYTHON_VIRTUAL_ENVIRONMENT\\DIRECTORY\\AppData\\Local\\Temp\\tmpymczar64prefect\\home\\prefect.db-shm', "[Errno 2] No such file or directory: 'D:\\\\PREFECT \\\\HOME\\\\DIRECTORY\\\\home\\\\prefect.db-shm'"), ('D:\\PREFECT \\HOME\\DIRECTORY\\home\\prefect.db-wal', 'C:\\PYTHON_VIRTUAL_ENVIRONMENT\\DIRECTORY\\AppData\\Local\\Temp\\tmpymczar64prefect\\home\\prefect.db-wal', "[Errno 2] No such file or directory: 'D:\\\\PREFECT \\\\HOME\\\\DIRECTORY\\\\home\\\\prefect.db-wal'")]
06:00:08 AM