TimelyDataflow / timely-dataflow

A modular implementation of timely dataflow in Rust
MIT License
3.3k stars 272 forks source link

A Question About Asynchrony #100

Open allain opened 7 years ago

allain commented 7 years ago

How are async operations handled?

If a message to a vertex represents a request for an HTTP Request, which causes another messages to be emitted with the returned payload, how would that fit into this model?

I may be misreading things but I don't see how an async operation and the processing of a message that doesn't output in messages can be told apart.

Are async operations exclusively intended to be "outside" the system?

(Admittedly I'm new to rust and may just be misunderstanding things).

frankmcsherry commented 7 years ago

Essentially all of the messaging in timely is async, message send and receive calls are all non-blocking, and only mean that in the future an instance of the downstream operator will receive what you've sent.

This means that all interfaces to timely computations need to be async as well, and in the case of an HTTP request you would be most likely to structure that as a request that you issue with some handle or connection information about where to return the result. As the request moves through the dataflow, it would keep this tag with it. The tail end of the dataflow would see a stream of assembled responses with these tags, and it would need to track down the connections on to which to foist the result. (edit: this is assuming you are writing a web service / server; if you were imagining a dataflow that makes http requests, that would probably look different).

I think the model of "web service as dataflow" has a lot of merit; my recollection is that @antiguru and @utaal may have looked at this, and I think the problem they found was that the Rust web frameworks weren't amenable to the adaptation in part because they like to be sneaky and secretive about their resources (e.g. with shared buffers managed cleverly). At the same time, the Soup folks at MIT are looking into dataflow for web databases, and I think they've had some luck with it.

frankmcsherry commented 7 years ago

I'm not sure I've nailed your question, but if you have an example "application" where you are trying to figure out whether timely would fit in, we could try and talk that through in more detail. I'm not 100% I understand what functionality you are trying to fit in.

allain commented 7 years ago

Thanks for replying.

I suppose the problem I try out new tech with is a website scraper. I have about 400 clients with around 50 pages each form which I've got to scrape content.

If I'm understanding timely correctly, messages are delivered asynchronously (there's no saying exactly when a particular message will be delivered), but that once delivered to a vertex in OnRecv that it's all synchronous until OnRecv is complete.

If that particular OnRecv code examines the message and decides that it can't do anything until it has the content from a URL, I'm not sure how that'd get handled.

frankmcsherry commented 7 years ago

Aha. Ok, so between dataflow stages there are no guarantees about synchrony (but some about order). Within the context of an operator, the execution is single-threaded, and the code that you write from input.next() through output.send() runs uninterrupted. It may be that your operator gets some inputs (e.g. urls) that lead to work (e.g. async calls to servers) that cannot be resolved immediately, which you are welcome to punt on for the moment (e.g. with a Future) and finish processing in some future invocation of the operator.

There are some details here; timely has "capabilities" that allow operators to reserve the right to send output messages in the future, and you'll want to stash them with any Future that hasn't been resolved. For external async io, this is probably the way it works out, and timely doesn't really help you here (in the way that perhaps Tokio would).

So, I think you could write a vertex that "pretends" to be some async service, where you send it messages describing requests, it starts up the work and sends responses when they complete. Timely wouldn't be doing anything especially helpful here, and I guess what would be neat here is that it could interop with the rest of a dataflow computation.

Does this help?

themoritz commented 1 day ago

I think this live coding video of @frankmcsherry also helps explain how something like this could be done.

I'm wondering, why did none of the operators from the video make it into the crate?

frankmcsherry commented 1 day ago

I think in principle they could (and some have been, in the past). The puzzle to solve is whether the timely repo is the right place to land collections of operators that layer on top of timely, or just the place to have the core operators you need to use timely (and leave the extensions for another crate). For example, differential dataflow lives off in its own crate.

The async stuff doesn't currently live here because it ended up being more demo-ware than highly recommended. At Materialize we use different operators, and it felt bad maintaining in the repo implementations that we don't use, nor entirely recommend (e.g. the "event protocol" diverged from what is used elsewhere in the system, and ends up single-threaded vs scale-out-able).

I think there's a legit point that it would be helpful to record these idioms somewhere, so that folks can pick them up if they want, even if we aren't actively using them. But also, there's a tax from each operator in the form of maintenance, and it feels good to prune the ones that aren't in use.

themoritz commented 1 day ago

Yeah, that makes sense. I'm already looking at the timely-util crate.

I've seen other projects have a {project}-extras/utils crate where these non-core extensions live. One thing that's definitely helpful I think would be to make the readme point to the utils crate in the Materialize repo. (And maybe extend that one with a couple of examples because the code is very abstract. 😄)