dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.96k stars 1.5k forks source link

Concurrency Limits as a Resource Property #19173

Open Daniel-Vetter-Coverwhale opened 10 months ago

Daniel-Vetter-Coverwhale commented 10 months ago

What's the use case?

I think it would be really awesome to be able to specify a concurrency limit as a property of a resource.

For example, to say this is the resource for my connection to the production database, and at any time I don't want there to be more than say 10 ops using this resource. This can sort of be worked out by using the resource lifecycle hooks, but I think that would result in a spurious run failure, and so I think it would be better if this information were available and used further upstream, either by the coordinator, the run launcher, the auto-materialize daemon, or some other piece that I'm not thinking of.

To that end, having the information available on the resource itself seems really useful. It also allows for the separation of concerns for different authors, so whoever is writing the asset doesn't have to worry about overloading the resource, and the author of the resource doesn't have to worry about having that resource being overloaded.

I think we would want to expose this information in the UI, both the limits themselves on the resources, and also the utilization and utilization attempts. For things that are scheduled that use the resource, Dagster could even proactively show that the resource is likely to be overloaded at the given time, which could lead to staggering the schedules manually, accepting that one of the ops/jobs will be delayed, or scaling the resource and increasing the limits. Reactively I think it would definitely be important to show when a job was delayed due to resourcing.

I thought of this recently as I was trying to set concurrency limits via op tags on some generated sling assets, but the op tags don't appear to be populating on the job for auto-materializing the assets, and I'm not sure if that's because op_tags in general don't get added to job config or if I'm just missing where I should be looking for op_tags on assets or something else.

Ideas of implementation

No response

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

cbini commented 2 months ago

This would be useful at a resource level for sensors/schedules as well. One of our vendors throttles the frequency of connections from a single IP. Their SFTP server has one hostname, but we have multiple instances of the product with their own respective logins, so it's difficult to coordinate which login is allowed to access the server at a given time.

aleexharris commented 2 weeks ago

Just wanted to jump in here and say that I am also tackling similar issues with Dagster.

I have to deal with many external database and API connections. Each database/API has its own set of rate limits, which I can manage within a single run but across many different ops/assets/runs, this becomes incredibly difficult to manage without running a separate middleware server for managing external API calls.

As an MVP, a great simple addition would be a "one at a time" flag, which could be used to signify when a resource can only be used by one run at a time.

I have been trying out the dagster/concurrency_key op_tag for assets to limit concurrency on assets which use the same external resources and whilst this is a big step in the right direction, I find that it often gets into "pseudo deadlocks" when you limit concurrency too aggressively. For example, when you set a concurrency_key's value to 1 and then launch 50 runs against that concurrency key, you often find that it gets stuck at some point halfway through a run and refuses to continue.

smonv commented 1 week ago

@Daniel-Vetter-Coverwhale thanks for raising this issue. Can you share more about how to limit the number of ops run via resource lifecycle hooks ?