dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
10.86k stars 1.36k forks source link

Filtering (and grouping) assets on op/job tags #12682

Open GerbenvdHuizen opened 1 year ago

GerbenvdHuizen commented 1 year ago

What's the use case?

Tags in our use case, can be used to categorize the process, method, framework used by/for ops and jobs. For example, if data was extracted from postgres in a certain task we would use a "postgres" tag on the op or job responsible for extracting this data. Or maybe a certain job is responsible for extracting and loading data, so you would tag it with "ETL".

We propose the ability to filter and/or group assets based on op/job tags, so we can create a view of assets associated with a certain tag.

Ideas of implementation

The ops/job responsible for materialising the asset would contain the tags, but this feature should make it possible to see which tags are related to an asset within the lineage and key prefixes Dagit overviews.

Extra nice to have feature:

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

sryza commented 1 year ago

This request makes a lot of sense to me. Forwarding to our UI folks: @bengotow @braunjj, although possibly requires some backend work as well.

@GerbenvdHuizen if you have a graph-backed asset with multiple ops, might different ops inside it have different values for the same tag key?

GerbenvdHuizen commented 1 year ago

@sryza for our use case, we would mark the ops that materialize an asset with the namespace that they are running in. For example, we would tag all the ops with a namespace, e.g. {"namespace": "postgres-extractor"}, or we would tag the ops with the process they use {"process": "ETL"} or {"process": "model"}. So in these cases the tag key for multiple ops would always have the same value. However, there is probably a use case where it would indeed be nice to use the same tag key with different values for ops associated with the same asset. For example, when you use tag keys like data_source or dependency, although we don't have use case for that right now.