dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.31k stars 1.43k forks source link

Add support for subpartitions #14228

Open ghost opened 1 year ago

ghost commented 1 year ago

What's the use case?

Our use case is a bit peculiar since we're not dealing with a machine learning pipeline in which there are many inputs (a huge data series) and few outputs (an inference model). We run a processing pipeline that manipulates GIS data and we want to track every file we work on as it progresses in our pipeline.

That said, we would like to have a better way to partition our data as described here: https://dagster.slack.com/archives/C01U954MEER/p1683807339107629

Not only that, we also want to have a good way to browse these partition hierarchies with filters, search and so on. This is hugely important for us since we're looking for visibility on how our assets progress in the pipeline.

Ideas of implementation

I'm not familiar with the code base, but I would propose something similar to MultiPartitions but in which the partitions generated are not based on two independent axes, but on two dependent ones. The second axis would depend on the first meaning it isn't a cross product but a subpartition.

smackesey commented 1 year ago

Seems unlikely to be implemented in the near future but this is a good thing to be tracking support for. cc @sryza @clairelin135

relativistic commented 1 year ago

I think this is similar to my request on slack, which I described as "hierarchical partitions" https://dagster.slack.com/archives/C01U5LFUZJS/p1682694979894159

We have a large number of partitions based upon files. Some way to organize them would be very helpful, so you don't have to scroll through such a huge list.

christeefy commented 7 months ago

I have a similar need for dependent/hierarchical partitions as described above. My need is to have historical lineage for machine learning models for different customers, which are trained at different schedules / ad-hoc.

More info available at https://github.com/dagster-io/dagster/discussions/20264.

clement-uspace commented 7 months ago

We have a similar use case here. Basically some of our assets would have daily partitions but also two or more parent "variables". Basically we'd have a group of entities "A", a group of entities "B", and an asset corresponding to the interactions between the two each day, so partitions like ["A1/B1", "A1/B2", "A1/B3", "A2/B1", ...] etc for each day. That's not limited to two groups, there might be more, which means it would be nice to support an arbitrary number of variables.

lrevest commented 6 months ago

We have, too, a very similar use-case (ETL-like scenario where functional workflows are organized at filesystem-level, where each folder (static partition) holds an arbitrary number of files (dynamic partition)) for which subpartitioning would bring clarity as much as a greater code maintenability.

emma-campbell commented 6 months ago

Just dropping in to +1 this feature request, with similar ETL-like structure to above described scenarios.

the4thamigo-uk commented 6 months ago

+1 here, this would be great for multi-tenancy, having a tenant dimension, an account/customer tier and a time dimension.

tdlangland commented 5 months ago

Another +1 from the IoT space where each date (static) has data from an arbitrary number of sensors (dynamic)

DreamwareDevelopment commented 2 months ago

When this is under consideration the people familliar with the codebase should also look into this as they are related

ravenac95 commented 1 month ago

This would be useful if not necessary to fully support sqlmesh as there are separate partitions possible for each of the models inside of a sqlmesh project.

I'll likely have to come up with some other solution for this in order to enable sqlmesh and dagster right now. This is likely to produce a form of the sqlmesh assets that looks quite a bit different than what currently exists for dbt. However, this will give the most flexibility to how those things can be configured for sqlmesh assets.

For reference, I am developing a dagster-sqlmesh integration here: https://github.com/opensource-observer/dagster-sqlmesh