PrefectHQ / prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
https://prefect.io
Apache License 2.0
17.49k stars 1.64k forks source link

Allow globally disabling task run result persistence #5888

Closed marvin-robot closed 2 years ago

marvin-robot commented 2 years ago

Opened from the Prefect Public Slack Community

tim.enders: What is the equivalent to this in 2.0? @task(checkpoint=False) If there is one currently

kevin701: None yet. It’s tied to results and configurability of results is not out yet

tim.enders: OK, cool. Gonna have to put 2.0 down then it seems. When I parallelize the operations it seems to want to spam getting a token from each Dask run. I know that checkpoint was the solution in 1.0

anna: <@ULVA73B9P> open "Allow globally disabling task run results as it's a blocker for 2.0 adoption"

Original thread can be found here.

zanieb commented 2 years ago

This sounds like disabling checkpointing resolved another issue and that persisting results isn't actually the core issue here? We can take the feature request to disable persistence, but it sounds like resolving this Dask issue would be more meaningful for this user.

anna-geller commented 2 years ago

thanks @madkinsz - I agree this issue may be better resolved in some other ways, but this here is only one example request, @kvnkho saw more users with issues that occur due to checkpointing results (e.g. not enough memory to pass large dataframes between tasks that don't really need Results checkpointing)

it would be great if there was a way to disable it - it could potentially solve even issues such as this one: https://github.com/PrefectHQ/prefect/issues/5866

kvnkho commented 2 years ago

Yes the current checkpointing (if that is the one being inserted into the database) is causing a lot of HTTPX timeouts when people move from local Orion to Cloud 2.0, which makes Cloud 2.0 seem unstable but really it's a timeout due to a large payload I think.

zanieb commented 2 years ago

Yes the current checkpointing (if that is the one being inserted into the database) is causing a lot of HTTPX timeouts when

This is a separate bug and will be fixed in the next release.

(e.g. not enough memory to pass large dataframes between tasks that don't really need Results checkpointing)

Stashing the value in a file and using a reference would only help with memory constraints here? Data needs to pass between tasks regardless of any checkpointing settings.

it would be great if there was a way to disable it - it could potentially solve even issues such as this one: https://github.com/PrefectHQ/prefect/issues/5866

Similarly, this is an issue with unpicklable data. Disabling checkpointing can help in some cases, but data still needs to be pickled for transport across tasks for several task runner types.

anna-geller commented 2 years ago

@madkinsz thanks for all the explanation, this helps a lot

data still needs to be pickled for transport across tasks for several task runner types

for Dask and Ray, correct? Concurrent and Sequential should work?

zanieb commented 2 years ago

for Dask and Ray, correct? Concurrent and Sequential should work?

In theory, we shouldn't need to pickle things for those, yeah. Although things like database connections still might not share well across threads.

reynoldsm88 commented 2 years ago

@anna-geller i believe that @kvnkho @madkinsz discussed this on Slack, but I would like to add my +1 to this issue. It represents a significant challenge in our ability to adopt prefect, in particular all the functionality that comes with orion.

tibuch commented 2 years ago

Hi all, I got redirected from slack and asked to provide a few more details about my use-case :v:

My use-case involves handling of multi-dimensional image data from 2D up to 5D (3D volumes over time with multiple channels). The smaller images usually start with a size around 350MB but larger volumes can easily reach 10-20GB with whole datasets reaching multiple TB.

The current implementation of caching in Prefect 2.0 is rather deadly for such a use-case, because I would have to be extra careful to never pass an image between tasks. Otherwise caching will save (potentially very large) image files to disk, blowing up my file-storage.

What I really like about Prefect 1.0 is the fact that checkpointing is a conscious decision i.e. I don't duplicate TBs of data by accident. And by passing a custom LocalResult type I am able to define how and where my cached results should be stored. Often times we want to look at the intermediate (cached) results to verify processing steps.

Currently, Prefect 2.0 is not a feasible solution for large image processing pipelines where I only want to cache very few task results in an accessible way. I would also be scared of accidentally filling up our storage system by running.

My dream-world-wish-list for caching in Prefect 2.0 would include:

  1. Disabling the default caching for a whole workspace --> Making caching an active developer choice.
  2. Having access to the LocalResult concept.

Thank you and happy to jump on a zoom to dive deeper (explain better) my use-case.