dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.15k stars 1.4k forks source link

In steps that materialize multiple partitions, allow specifying metadata per partition #18123

Open EtienneT opened 10 months ago

EtienneT commented 10 months ago

What's the use case?

Let's say you have a partitioned asset with BackfillPolicy.single_run(), which means your asset could potentially be materializing multiple partitions at the same time. You return a dataframe which will then be separated in their individual partitions, but then you realize that there's no way to call context.add_output_metadata for a specific partition inside your result. So if you need to add metadata per partition on your result, you can't.

I guess you would rely on a separate asset observation, but this just add unnecessary overhead.

Ideas of implementation

No response

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

What we've heard

VirtueMe commented 7 months ago

I file this under additional information:

A question related to this is how to avoid repeating the metadata for each partition:

I not sure if I'm doing something wrong, but it seems that the one context.add_metadata_output is repeated 14 * 8 in the event list. And every event output is the same but with different timestamps. See attached image. I guess is outputs the metadata for each partition requested.

Partitions requested materialization:

(13, DateTime(2024, 2, 1, 0, 0, 0, tzinfo=Timezone('UTC')), DateTime(2024, 2, 14, 0, 0, 0, tzinfo=Timezone('UTC')), ['tights.no', 'comfyballs.no', 'comfyballs.se', 'comfyballs.fi', 'comfyballs.com', 'awarenutrition.se', 'awarenutrition.fi', 'soma.no'], PartitionKeyRange(start='2024-02-01|tights.no', end='2024-02-13|soma.no'))

Event output:

image