Open tekumara opened 2 years ago
Hi thanks for sharing this!
I was wondering if you can test this with the newest Prefect version
pip install -U "prefect>=2.0"
Also, I suspect you need to use a remote storage for Ray if remote address is provided, see: https://orion-docs.prefect.io/concepts/storage/
This still occurs in prefect 2.3.2.
I'm trying to run directly against the cluster without a Deployment, and so there is no explicit storage involved.
I would have thought it possible to run the flow locally, but submit tasks to the remote ray cluster.
From the stack trace above it looks like the error is occurring in the filesystems module.
Digging into this further, it looks what is happening is prefect is persisting task run results to PREFECT_LOCAL_STORAGE_PATH
. This path defaults to ${PREFECT_HOME}/storage
.... the problem seems to be that prefect is resolving ${PREFECT_HOME}
in the process running the flow (eg: my laptop which is/Users/tekumara/.prefect
) and not in the process running the task (eg: on the ray cluster this would be /home/ray/.prefect
)
My current workaround is prior to running the flow, set PREFECT_LOCAL_STORAGE_PATH to a writable path inside the container running on the ray cluster, eg:
export PREFECT_LOCAL_STORAGE_PATH=/tmp/prefect/storage
Thanks so much for helping debug this!
I think we could use by wrapping call.func
in task_runner.submit
with
https://github.com/PrefectHQ/prefect/blob/b65d1366eeb89fb3593a546459859d825af8f37d/src/prefect/settings.py#L806
e.g.
async def submit(
self,
key: UUID,
call: Callable[..., Awaitable[State[R]]],
) -> None:
def _submit_call_func_wrapper(*args, **kwargs):
with temporary_settings(updates={PREFECT_LOCAL_STORAGE_PATH: "/tmp/prefect/storage"}):
return call.func(*args, **kwargs)
if not self._started:
raise RuntimeError(
"The task runner must be started before submitting work."
)
call_kwargs = self._optimize_futures(call.keywords)
# Ray does not support the submission of async functions and we must create a
# sync entrypoint
self._ray_refs[key] = ray.remote(sync_compatible(_submit_call_func_wrapper)).remote(
**call_kwargs
)
And maybe could resolve https://github.com/PrefectHQ/prefect-ray/issues/37 with Michael's suggestion too.
Would you be interested in contributing a PR?
We also ran into this. The workaround we used is setting export PREFECT_HOME="/tmp/prefect"
on the laptop, which exists both on the laptop and the cluster. We should try to fix this upstream or otherwise every single user of prefect-ray
will run into this problem :)
I'll attempt to address this as a part of https://github.com/PrefectHQ/prefect/pull/6908
Thanks, that's awesome! I'm happy to try out a PR once it is ready, I have a setup that reproduces the problem. Also happy to try out if https://github.com/PrefectHQ/prefect/issues/13015 fixes the problem in case that helps you @madkinsz :)
I think we'll want to consider more fundamentally the "local persistence for tasks on remote workers" story. I added a couple tickets to the tracking pull request:
- Ensure task run results persisted to local file systems on remote workers respect relative paths
- Investigate storing results after return from the remote worker for task runs with local file systems
Btw, unfortunately the workaround from https://github.com/PrefectHQ/prefect/issues/13015 is not working for me, even after some obvious modifications to fix the obvious problems with it (like shuffling sync_compatible into the function). Maybe temporary_settings
is not enough (I have a feeling these settings are pickled with cloudpickle, and temporary_settings is not enough to overwrite that, but I didn't dig deep enough to really understand what is going on).
What did work for me is the following atrocious hack:
diff --git a/prefect_ray/task_runners.py b/prefect_ray/task_runners.py
index deb7a8d..fcafb74 100644
--- a/prefect_ray/task_runners.py
+++ b/prefect_ray/task_runners.py
@@ -79,6 +79,11 @@ import anyio
import ray
from prefect.futures import PrefectFuture
from prefect.orion.schemas.states import State
+from prefect.settings import (PREFECT_HOME,
+ PREFECT_PROFILES_PATH,
+ PREFECT_LOCAL_STORAGE_PATH,
+ PREFECT_LOGGING_SETTINGS_PATH,
+ PREFECT_ORION_DATABASE_CONNECTION_URL)
from prefect.states import exception_to_crashed_state
from prefect.task_runners import BaseTaskRunner, R, TaskConcurrencyType
from prefect.utilities.asyncutils import sync_compatible
@@ -116,6 +121,13 @@ class RayTaskRunner(BaseTaskRunner):
address: str = None,
init_kwargs: dict = None,
):
+ import pathlib
+ PREFECT_HOME.value = lambda: pathlib.Path("/tmp/prefect")
+ PREFECT_PROFILES_PATH.value = lambda: pathlib.Path("/tmp/prefect/profiles.toml")
+ PREFECT_LOCAL_STORAGE_PATH.value = lambda: pathlib.Path("/tmp/prefect/storage")
+ PREFECT_LOGGING_SETTINGS_PATH.value = lambda: pathlib.Path("/tmp/prefect/logging.yml")
+ PREFECT_ORION_DATABASE_CONNECTION_URL.value = lambda: pathlib.Path("sqlite+aiosqlite:////tmp/prefect/orion.db")
+
# Store settings
self.address = address
self.init_kwargs = init_kwargs.copy() if init_kwargs else {}
Any updates on this?
As a workaround for this issue, we are adding instructions to the prefect-ray
documentation to recommend updating the PREFECT_LOCAL_STORAGE_PATH
setting to a path available on the Ray worker and in flow execution environment in PrefectHQ/prefect-ray#47. This is not a perfect solution, but should unblock current use cases. We will continue to work on improving results management when local storage is used in conjunction with remote workers.
When running a flow from my laptop against a remote ray cluster, prefect tries to reference directories that only exist on my laptop (eg: /Users/tekumara/.prefect/storage):
_flows/rayflow.py:
prefect 2.0b12