dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
12.01k stars 1.5k forks source link

Using files as inputs and outputs #2187

Open multimeric opened 4 years ago

multimeric commented 4 years ago

I'm writing a pipeline which does a lot of file manipulation. So downloading a file, manipulating it, saving it again etc. I also want my workflow to work in the cloud if I scale it up. Also, in my workflow the files are not Materializations, because they're the main inputs and outputs of a solid. I feel that the documentation doesn't really explain how to do this.

There is a FileHandle, LocalFileHandle, and Path types built in to dagster. Do any of them solve my need for platform-agnostic file storage?

Does the following solid for downloading a file make sense?

@solid()
def download_file(context, url) -> LocalFileHandle:
    req = requests.get(url, allow_redirects=True)
    with tempfile.TemporaryFile(delete=False) as temp:
        temp.write(req.content)
    yield temp.name

What we've heard:

alangenfeld commented 4 years ago

We are definitely lacking in out of the box tools to solve this problem elegantly. The Handle types you've bumped into are a bit stale, and come from us expecting this type of pattern to be common.

My recommendation would be to look at using the Resource and DagsterType systems to abstract how you want to think about these file resources in your workflows. Using these two systems together should allow you to make a single workflow that can run in various permutations of where the files are sourced (local/remote) and where the computation is happening (local/cloud). Defining your own resources and dagster types will allow you to encode the exact expected behavior you want to achieve.

https://dagster.readthedocs.io/en/stable/sections/learn/guides/dagster_types.html https://dagster.readthedocs.io/en/stable/sections/tutorial/resources.html

mgasner commented 4 years ago

2282

multimeric commented 3 years ago

2282

This issue was closed. Has there been any other notable progress on using files?

multimeric commented 3 years ago

I'm guessing the main change relates to the addition of the IO Manager, added in 0.10.0. Is there any example code involving the use of files in relation to these managers?

spenczar commented 1 year ago

What is the current status of this? Objects like LocalFileManager are still present in the codebase (https://github.com/dagster-io/dagster/blob/7a8ba5c303b31a6af197177999d49166097711fa/python_modules/dagster/dagster/_core/storage/file_manager.py#L233) but are constructed as resources rather than IOManagers. Still, they seem to do exactly what I need for my use case - some of my ops construct SQLite databases which I want to pass around, and the IOManager framework doesn't seem to match that very well.

mfasanya commented 1 year ago

I'm still looking for a way to do this. We output .geojson and something folders containing shp file data. We want to transfer these across assets/ops.

j2bbayle commented 1 year ago

Same needs here. We are required to produce intermediary products to be archived in a given file format (HDF5). It would be great to have this feature -- it's a blocker currently!

sryza commented 1 year ago

@j2bbayle could you use a pattern like this? https://docs.dagster.io/guides/dagster/non-argument-deps#assets-without-io