dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.69k stars 1.48k forks source link

Adding run tags according to the selected assets #8979

Open jburnich opened 2 years ago

jburnich commented 2 years ago

Hi Dagster ! Is it possible to add tags to a job only if an asset is selected when the job is run?

I have for example a partitioned job with 2 assets (A -> B). The executions of asset A cannot be done in parallel due to requests on an API, while the executions of asset B can be done in parallel.

So I would like to add a tag so that as soon as asset A is selected, the executions run in series. And as soon as only asset B is selected, the executions are done in parallel. If A and B are selected, the executions are done in series. I then use this tag in the QueuedRunCoordinator.

There is the possibility to add the tag manually during the backfill but I would like this step to be done automatically according to the selected assets.

Relevant requests:

sryza commented 2 years ago

We could potentially add a run_tags attribute to AssetsDefinitions. Something we'd need to deal with though: what if two different assets provide different values for the same tag, and those assets are included in the same run?

jburnich commented 2 years ago

Thanks @sryza ! Maybe do a check on the tags when defining the job. So if several assets have identical keys with different values, an error is returned. I don't know if this is feasible in practice.

sryza commented 1 year ago

Feedback from Philippe Laflamme: https://dagster.slack.com/archives/C01U954MEER/p1673529429283949


I have an @asset that queries an endpoint when materializing. This endpoint has a concurrency limit of n, which means I need to only materialize n assets concurrently. I’ve setup the QueuedRunCoordinator with tag_concurrency_limits, say http/limit: n which limits to n. How do I apply this tag to my @asset? I’ve tried using op_tags but that doesn’t seem to apply it when materializing; and for backfilling, I’ve found that:

C0DK commented 1 year ago

I just want to give my 5 cents.

This is also an issue for me, for the same reason as Phillippe points out. Even if there is a better fix than throwing an error when tags collide, one could start by implementing this with the error, and later fixing the edge case.

Tags seem super neat, and it's a shame we cannot utilize them more in software-defined assets! And I was in general expecting jobs simply to inherit all tags from ops/graphs below. Maybe this should, for the time being, be noted in the documentation, that these two types of tags are not strictly related

sryza commented 1 year ago

A data point here from a user we spoke to about this:

johannkm commented 1 year ago

Another user wants to use tags on runs of different assets to send alerts to the responsible team when auto materialize runs fail

pablo-statsig commented 1 year ago
  • If a run includes multiple assets with different resource requirements, they want to use the maximum

One thing that would be problematic with this is that if you have an asset that requires 5G and another that requires 3G choosing 5G wont necessarily be enough. I would love to have a way to provide a function that is able to merge tags or maybe some way of defining a merge strategy per tag (min, max, sum, any)

johannkm commented 1 year ago

I'll call out that if you're looking to use tags specifically for concurrency limits, you can check out op and asset concurrency limits which are available today.

chrishiste commented 5 months ago

+1 I would like to set tags={"dagster/max_retries": 3, "dagster/retry_strategy": "ALL_STEPS"} on an asset_graph.