dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.09k stars 1.39k forks source link

Automatic pruning for `TimeWindowPartitionsDefinition` #23076

Open NewbiZ opened 1 month ago

NewbiZ commented 1 month ago

What's the use case?

I have an asset that stores daily data on a number of items. I have over a thousand items, and multiple years of data. The asset is currently partitioned by day and item (a MultiPartitionsDefinition of DailyPartitionsDefinition + DynamicPartitionsDefinition), leading to an enormous amount of partitions (e.g. 1000 items over just a year = 365,000 partitions).

Clearly, this number of partitions does not work with Dagster, the UI becomes unresponsive as soon as I try to display asset lineage or just list existing partitions.

I don't really need to have all these partitions all the time, as long as the underlying files are not removed. The daily partitioning is useful because it allows backfilling, which is a useful support operation, but on a business-as-usual basis, I could just have say the last 30 days and live with that.

This seems like an trivial use case, but I started using Dagster just a week ago so I may have completely missed an obvious way to solve this.

Here is what I tried:

Ideas of implementation

I would propose either of the following solutions as possible fix:

NewbiZ commented 1 month ago

Nevermind, the DynamicPartitionsDefinition replacement doesn't work at all in the end.

First, because schedules on dynamic partitions are not really able to take additions / removal into account.

Second, because there is no notion of current partition on dynamic partitions, so dependencies by downstream assets are broken as well.

I am confused as to how partitions are supposed to be used.

NewbiZ commented 1 month ago

I'm coming to the conclusion that the easiest thing to do to overcome the limitations on the number of time window partitions is to emulate them with a config:

That should fix the issue of scheduling with dynamic partitions.