dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.68k stars 1.47k forks source link

Enable/disable single-run/multiple-run materialisation on the GUI #14999

Open bmarcj opened 1 year ago

bmarcj commented 1 year ago

What's the use case?

Currently any partitioned assets can be run as single runs, or as multiple runs. The latter has one partition per run, while the former combines multiple assets into a single run. The former is useful when individual assets are small, or work is offloaded to something like Snowflake and partitioning serves no purpose.

Whether or not this is supported depends on the exact details of the rest of the code: https://docs.dagster.io/concepts/partitions-schedules-sensors/backfills#single-run-backfills

If your code uses any of the partition_key, asset_partition_key, asset_partition_key_for_output, or asset_partition_key_for_input context methods or properties, you'll need to update it to use methods or properties like asset_partitions_time_window, asset_partition_key_range, asset_partition_keys, asset_partitions_time_window_for_output, asset_partition_key_range_output, asset_partitions_time_window_for_input, or asset_partition_key_range_for_input instead.

It's not clear what the consequences are if the code does use partition_key, asset-partition_key, asset_partition_key_for_output or asset_partition_key_for_input instead of window/range equivalents. In the best case, presumably an error. In the worst case, the job might silently succeed while producing unintended behaviour.

There is no way to disable this functionality, and so users are left relying on what they can guess or remember about how individual assets or computed, loaded and persisted. Meanwhile the developer has no easy way of communicating the intended or optimal choice.

The best workaround right now is for IO managers and @asset ops to explicitly check the context only involves a single partition to prevent single-runs across multiple partitions.

It would be a good idea if developers could include this information so that the GUI (or CLI) can forbid or allow only particular ways of backfilling.

Ideas of implementation

Adding an enumeration to the @asset annotation for developers to explicitly allow, or disable, single or multi-run back-fills, that the GUI can present to users.

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

smackesey commented 1 year ago

cc @bengotow @clairelin135

sryza commented 1 year ago

There's a proposal here that would address this: https://github.com/dagster-io/dagster/discussions/14829.

I believe @ruizh22 is planning to work on this.

bmarcj commented 1 year ago

That looks a good proposal. It would be great to have cleaner separation of partitioning of the data (a data modelling decision) versus computation across the data (driven by practical considerations like memory, CPU, cost...). Currently it is not separated, and this is why the single/multi run is even an issue.