TimelyDataflow / timely-dataflow

A modular implementation of timely dataflow in Rust
MIT License
3.26k stars 272 forks source link

how to avoid influence between dataflows #279

Open peakerli opened 5 years ago

peakerli commented 5 years ago

Hi, @frankmcsherry, I have set up timely as a long-time service to server queries. But the problem is that in timely all dataflows share the same computation resources, which will lead to a long execution time for a simple dataflow when there is a big dataflow is operated at the same time. So i think if you have an idea to address this problem?

frankmcsherry commented 5 years ago

Hello.

This is by design (so that dataflows can share data assets without requiring locks and such. Within a worker thread, operators are scheduled as cooperative fibers.

Differential dataflow has its expensive operators (e.g. join) yield control regularly (after 1M output records, but there could be better rules). In the future, things like async/await could make this easier, but at the moment it is up to you.

Other options include having compute intensive threads deposit results in a shared cache, so that other decoupled threads can read from it without awaiting the producing thread. Or, if the tasks are unrelated, you can spin up separate timely clusters for each independent computation.

Hope this helps!

bmmcq commented 5 years ago

In our application, we may run thousands of queries at the same time, each query was compiled to be a timely dataflow job. Usually, queries take less than 1s, but we can not know which query would take much more time in advance; So we wonder is there a better way to concurrently run many timely-dataflow jobs in one timely environment(e.g. same network connections). For example, if we can run different timely-dataflow jobs in different worker threads, but without establishing the TCP connections, it may be helpful.

frankmcsherry commented 5 years ago

A few things come to mind, but each of them have their trade-offs.

  1. You could look in to the potentially expensive computations, and ensure that their operators are written to yield control if they take a long amount of time. Timely is meant to support multiple concurrent dataflows, but this only works if operators release control often enough.

  2. You could write a timely dataflow operator which interacts with a task pool. This would allow compute heavy work to happen off of the timely worker threads, but this is only helpful if you can move the necessary state to another thread. It would probably require Arc wrappers around shared state, for example.

It's a bit tricky to know without a clearer understanding of the workload. The timely code is pretty close to allowing a pluggable scheduler (it is now hard to mis-schedule operators, although it is not yet easy to schedule them yourself).

If each of the dataflows are unrelated, and you just want to enact better resource sharing, this is a scheduling question (ensuring operators yield would be an important first step, but then more balanced scheduling would be a further improvement). If one dataflow is taking multiple timestamped requests and you would like to re-order them, this might be harder to do.

What is the system doing when you would prefer that it be working on short jobs?