PrefectHQ / prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
https://prefect.io
Apache License 2.0
15.97k stars 1.57k forks source link

Expose a way for users to clear cache keys #10494

Open EmilRex opened 1 year ago

EmilRex commented 1 year ago

First check

Prefect Version

2.x

Describe the current behavior

It is often the case that when users first use Prefect's task caching feature, they do not set a cache expiration. This means that there may be unwanted cached results for all future runs of a flow. Locally this can be overcome by resetting the database (prefect server database reset), but this also destroys any other metadata and is not an option with Cloud. As far as I can tell, there is not a method for clearing cache keys via the API.

Describe the proposed behavior

It would be super useful to be able to clear cache keys via the API, CLI, UI, or ideally all three. In most practical scenarios, keys need to be cleared on the flow or deployment level, not necessarily the individual key level. With that being the case, ideally cache keys could be cleared based on a flow name or a flow and deployment name combo.

Example Use

As an illustration of the above:

prefect flow clear-cache --name "my-flow"

prefect deployment clear-cache --name "my-flow/my-deployment"

Additional context

No response

OptimeeringBigya commented 1 year ago

From what I know cache keys are not bound to flows or deployments.

Additionally, there is not API to check if cache key is still valid (i.e. not expired).

ymtricks commented 1 year ago

The problem with current cache behavior is that cache keys are not unique. They are essentially just tags on the task results - there can be multiple results with the same key and different TTLs or no TTL at all. BTW this could be better documented and explained. Because of this design, there's no way to evict a cache entry via just running a task and that's why at least some complementary mechanism is required. The way we are working around the issue is by appending a "cache version" to each key, so whenever we want to evict old cache we just bump the version.

mgsnuno commented 11 months ago

@ymtricks in a similar way we ended up using this to mitigate our caching issues:

from prefect.context import FlowRunContext
from prefect.tasks import task_input_hash

def _cache_key_fn(context, parameters):
    flow_run = FlowRunContext.get().flow_run
    cache_key = (
        f"{context.task.name}-{flow_run.flow_id}-{flow_run.flow_version}-"
        f"{flow_run.deployment_id}-{task_input_hash(context, parameters)}"
    )
    return cache_key
j-tr commented 11 months ago

It would be extremely helpful to have some functionality that also clears the remote storage for cleared and expired cache keys. we are piling up significant amounts of cache data in an s3 bucket and there's no way to delete it without risking to run into the issue outlined in https://github.com/PrefectHQ/prefect/issues/8892

limx0 commented 11 months ago

I would like to see this implemented (and preferably a way to clear individual task keys also).

N-Demir commented 8 months ago

+1, cache management is very difficult in prefect and it makes using caching basically impossible. And, worst of all, you don't realize the scale of the limitations until you're heavily using it.

Ben-Epstein commented 8 months ago

Adding to this, something I've noticed, which is a bit confusing in the prefect cloud case, is that the task cache seems to be bifurcated between two places

  1. the prefect cloud database (which we have no control over)
  2. our specified cache storage location (we control)

One might think that, in order to clear the cache, one could delete the cache data in the cache storage location (say, s3, for example). But if you do that, prefect will

  1. check the cache key in the prefect cloud db
  2. see a cache hit, check s3
  3. no data found in s3
  4. raise exception

Since we can't control the database, one simple solution would be to change the behavior (or enable alternative behavior) such that if there's no data in the specified cache location, it invalidates the cache. This would align more closely with an actual cache, such that it would be a "cache miss". Since the cache has this bifurcation, giving the user control over a cache miss would be helpful.

cicdw commented 1 month ago

Hey everyone - we've heard this feedback and in 3.0 cache keys are explicitly filename references now.

To be more precise, 3.0 works as follows: When a task runs, it first computes its cache key and then does a lookup in the configured result location; if the file exists and is unexpired the data is loaded and returned. If the file does not exist, the cache is considered invalid and the task is rerun (no error is raised!). This is very similar to @Ben-Epstein's proposal above. This means that "clearing the cache" can be achieved by removing the result file.

The full exposition on 3.0 caching can be found here, and I'd be curious to know if that satisfies the various use cases everyone in this thread has. And if not, I'd love to hear that too.