Open Daniel-Vetter-Coverwhale opened 10 months ago
This would be useful at a resource level for sensors/schedules as well. One of our vendors throttles the frequency of connections from a single IP. Their SFTP server has one hostname, but we have multiple instances of the product with their own respective logins, so it's difficult to coordinate which login is allowed to access the server at a given time.
Just wanted to jump in here and say that I am also tackling similar issues with Dagster.
I have to deal with many external database and API connections. Each database/API has its own set of rate limits, which I can manage within a single run
but across many different ops/assets/runs, this becomes incredibly difficult to manage without running a separate middleware server for managing external API calls.
As an MVP, a great simple addition would be a "one at a time" flag, which could be used to signify when a resource can only be used by one run at a time.
I have been trying out the dagster/concurrency_key
op_tag for assets to limit concurrency on assets which use the same external resources and whilst this is a big step in the right direction, I find that it often gets into "pseudo deadlocks" when you limit concurrency too aggressively. For example, when you set a concurrency_key's value to 1 and then launch 50 runs against that concurrency key, you often find that it gets stuck at some point halfway through a run and refuses to continue.
@Daniel-Vetter-Coverwhale thanks for raising this issue. Can you share more about how to limit the number of ops run via resource lifecycle hooks ?
What's the use case?
I think it would be really awesome to be able to specify a concurrency limit as a property of a resource.
For example, to say this is the resource for my connection to the production database, and at any time I don't want there to be more than say 10 ops using this resource. This can sort of be worked out by using the resource lifecycle hooks, but I think that would result in a spurious run failure, and so I think it would be better if this information were available and used further upstream, either by the coordinator, the run launcher, the auto-materialize daemon, or some other piece that I'm not thinking of.
To that end, having the information available on the resource itself seems really useful. It also allows for the separation of concerns for different authors, so whoever is writing the asset doesn't have to worry about overloading the resource, and the author of the resource doesn't have to worry about having that resource being overloaded.
I think we would want to expose this information in the UI, both the limits themselves on the resources, and also the utilization and utilization attempts. For things that are scheduled that use the resource, Dagster could even proactively show that the resource is likely to be overloaded at the given time, which could lead to staggering the schedules manually, accepting that one of the ops/jobs will be delayed, or scaling the resource and increasing the limits. Reactively I think it would definitely be important to show when a job was delayed due to resourcing.
I thought of this recently as I was trying to set concurrency limits via op tags on some generated sling assets, but the op tags don't appear to be populating on the job for auto-materializing the assets, and I'm not sure if that's because op_tags in general don't get added to job config or if I'm just missing where I should be looking for op_tags on assets or something else.
Ideas of implementation
No response
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.