PrefectHQ / prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
https://prefect.io
Apache License 2.0
15.78k stars 1.54k forks source link

Support compression in flow storage #6167

Open kaliserichmond opened 2 years ago

kaliserichmond commented 2 years ago

Prefect Version

2.x

Describe the proposed behavior

With the new deployment.yaml, users previously were using the FilePackager with the pickle serializer because they are worried about large s3 costs. Extending support for compression formats like .zip or tarballs would be a good replacement for this functionality.

Describe the current behavior

Currently the deployment build CLI command auto-uploads individual files using file system blocks, and submitted runs use the same block to download these raw files. For certain deployments, there can be large quantities and sizes of files that are necessary to storage alongside the flow, and so having a way to compress the full directory before uploading and decompress when downloading would save users on storage costs.

Example Use

No response

Additional context

No response

gaby commented 2 years ago

👍🏻 for adding this and zstd support.

anna-geller commented 2 years ago

I fully understand the problem that a single zip file might be easier to manage in certain scenarios. However, I don't fully understand the argument for this being storage costs. The project is supposed to contain only flow code and optionally also custom dependencies. If the users have some custom files that shouldn't be uploaded, they can add those to .prefectignore starting in the next release

@kaliserichmond Not sure which users problems other than storage costs motivated this feature request but I believe that .prefectignore already solves it in a really flexible and simple way

cc @cicdw looping you in for feedback

samdyzon commented 2 years ago

Unfortunately, the .prefectignore functionality doesnt really solve the problem for us, and I reckon there are others who may have the same problem as us. I think the argument of storage costs isn't a massive problem, but actually its an issue of orchestration. If we have many flows in a project that we want to deploy, the current design requires one of two options:

  1. You deploy all flows to a single storage block and overwrite all files in the storage, regardless of which flow you actually made changes to.
  2. You deploy each flow to their own separate storage block.

We have several environments to deploy our flows to, and we want to deploy a flow in isolation - IE: when we make changes to flow A (and maybe a shared dependency), the code for flow B is unaffected. Ideally we could also roll-back to a previous version of flow A (again, without affecting the code for flow B). These requirements eliminate option (1) because we can introduce changes to the storage that may affect several flows in one deployment.

We have hundreds of files in our flows repository, with a roughly even split of flow definition files and a shared library of tasks/utilities that are used randomly throughout our flow code. Using option (2) we must deploy all of those files to each storage block for each flow - so we're storing the same files over and over again. Storage cost is relatively cheap, but in order for our flows to execute we have to wait for all of those files to be downloaded from the storage before flow can execute. This introduces some problems:

  1. Start-up latency: More files to download, the longer it takes to start up which introduces risk of:
  2. Start-up failure: More moving parts to deal with over the network increases the risk that one or more files fails to download - does Prefect have a robust methodology for retrying networking errors during start up? Or does the flow just fail?

The .prefectignore file could, in theory, ease this problem by allowing us to select only the files per flow - but in this scenario we would need a prefectignore file for each deployment - does Prefect support that?

A single zipped file of the flow dependencies alleviates the risks described above since we only need to download a single file from the storage and extract it into the local execution environment. It also allows us to use a single storage block for all flows in our project, while enabling the isolated deployment of each flow.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment.

gaby commented 1 year ago

Still valid

zanieb commented 1 year ago

@billpalombi should this still be deferred?

billpalombi commented 1 year ago

I've accepted this but it's low priority relative to the planned enhancements for projects and deployments