Open yuhan opened 3 years ago
Lighter-weight ops sound appealing from the context of this Slack discussion. I have a batch job that was generating 10-20k dynamic outputs, which are pretty lightweight, and executing ~30 concurrently. Right now I'm executing on the multiprocessing backend, but eventually planning for k8s as a way to scale beyond one machine's capacity. Dagster was getting bogged down with that, but switching to chunking (~30 ops, each of which takes a list of inputs and runs a for loop) is a workaround for now, however I lose some efficiency due to unpredictable batch execution durations. I'm not exactly sure what a multi-threaded / async executor would mean, but if it allowed lighter weight op overhead, or some lightweight parallelism (async) within more heavyweight ops (k8s) that could be great!
Having an async and/or threaded executor would be really helpful for some of our use case. We need to download files from external ftp / http servers, and would like to ~100 (or more) concurrent ops executing.
This is also extremely helpful when you want to share resources with all ops inside a give run. Think establishing connections with external systems to fetch resources these can have substantial overhead and would be of great help to be able to share the connections or resources with all ops inside a given run.
We've been using Dagster for a while in production now, and we find the lack of multi-threaded executor very painful on a regular basis. The multi-process executor can result in really long cold start delays between steps, as well as large memory consumption when the max concurrency is high. And the in-process executor on the other hand essentially prevents concurrent orchestration. This is really surprising to me that this as never been higher on the roadmap of the Dagster team.
Just realized I can run async
inside a multiasset
. This is pretty useful. Might be worthwhile to add to docs?
from dagster import (
AssetOut,
multi_asset,
asset,
in_process_executor,
define_asset_job,
Definitions,
AssetExecutionContext
)
import asyncio
async def slowreturn(var: str):
await asyncio.sleep(5)
return var
@multi_asset(
outs={
"x": AssetOut(group_name="g1"),
"y": AssetOut(group_name="g1"),
}
)
async def x_and_y():
x1 = asyncio.create_task(slowreturn("something"))
y1 = asyncio.create_task(slowreturn("else"))
x2 = await x1
y2 = await y1
return x2, y2
@asset
async def x_plus_y(context:AssetExecutionContext, x,y):
out = x + " " + y
context.log.info(out)
r = define_asset_job("async_test", "*")
defs = Definitions(
assets = [x_and_y, x_plus_y],
jobs = [r],
executor=in_process_executor
)
Would anyone be interested in collaborating on this together?
This issue contains a custom multi-threaded executor that someone created a while ago: https://github.com/dagster-io/dagster/issues/3177
Curious if this has any fundamental issues or could still be utilized?
This issue contains a custom multi-threaded executor that someone created a while ago: #3177
Curious if this has any fundamental issues or could still be utilized?
I actually tried to adapt this briefly. Wasn't super straightforward for for two reasons:
Definitley possible I would imagine, but beware - it's not plug and play
The multiprocess executor spin off... processes. Which essentially, AFAICT, always loads the entire project definition, hence the issues mentioned above. It doesn't seem to me that anything based on the multiprocess executor will help.
I just asked about support for multi-threaded inprocess executor in the dagster slack. unfortunately it isn't something which is on the roadmap for dagster at the moment.
https://dagster.slack.com/archives/C01U5LFUZJS/p1719904935167549
also i just had a brief look through the inprocess executor codebase but it's a bit hard to figure out how hard it would be to implement this. especially identifying all the thread unsafe parts.
Originated from https://github.com/dagster-io/dagster/issues/2268
Execute multiple ops simultaneously in the same process.
Prerequisites:
ComputeLogManager
needs to be changed to operate in terms of process scope instead of op/step scopeMessage from the maintainers:
Excited about this feature? Give it a :thumbsup:. We factor engagement into prioritization.