TimelyDataflow / timely-dataflow

A modular implementation of timely dataflow in Rust
MIT License
3.23k stars 272 forks source link

A Question About Asynchrony #100

Open allain opened 6 years ago

allain commented 6 years ago

How are async operations handled?

If a message to a vertex represents a request for an HTTP Request, which causes another messages to be emitted with the returned payload, how would that fit into this model?

I may be misreading things but I don't see how an async operation and the processing of a message that doesn't output in messages can be told apart.

Are async operations exclusively intended to be "outside" the system?

(Admittedly I'm new to rust and may just be misunderstanding things).

frankmcsherry commented 6 years ago

Essentially all of the messaging in timely is async, message send and receive calls are all non-blocking, and only mean that in the future an instance of the downstream operator will receive what you've sent.

This means that all interfaces to timely computations need to be async as well, and in the case of an HTTP request you would be most likely to structure that as a request that you issue with some handle or connection information about where to return the result. As the request moves through the dataflow, it would keep this tag with it. The tail end of the dataflow would see a stream of assembled responses with these tags, and it would need to track down the connections on to which to foist the result. (edit: this is assuming you are writing a web service / server; if you were imagining a dataflow that makes http requests, that would probably look different).

I think the model of "web service as dataflow" has a lot of merit; my recollection is that @antiguru and @utaal may have looked at this, and I think the problem they found was that the Rust web frameworks weren't amenable to the adaptation in part because they like to be sneaky and secretive about their resources (e.g. with shared buffers managed cleverly). At the same time, the Soup folks at MIT are looking into dataflow for web databases, and I think they've had some luck with it.

frankmcsherry commented 6 years ago

I'm not sure I've nailed your question, but if you have an example "application" where you are trying to figure out whether timely would fit in, we could try and talk that through in more detail. I'm not 100% I understand what functionality you are trying to fit in.

allain commented 6 years ago

Thanks for replying.

I suppose the problem I try out new tech with is a website scraper. I have about 400 clients with around 50 pages each form which I've got to scrape content.

If I'm understanding timely correctly, messages are delivered asynchronously (there's no saying exactly when a particular message will be delivered), but that once delivered to a vertex in OnRecv that it's all synchronous until OnRecv is complete.

If that particular OnRecv code examines the message and decides that it can't do anything until it has the content from a URL, I'm not sure how that'd get handled.

frankmcsherry commented 6 years ago

Aha. Ok, so between dataflow stages there are no guarantees about synchrony (but some about order). Within the context of an operator, the execution is single-threaded, and the code that you write from input.next() through output.send() runs uninterrupted. It may be that your operator gets some inputs (e.g. urls) that lead to work (e.g. async calls to servers) that cannot be resolved immediately, which you are welcome to punt on for the moment (e.g. with a Future) and finish processing in some future invocation of the operator.

There are some details here; timely has "capabilities" that allow operators to reserve the right to send output messages in the future, and you'll want to stash them with any Future that hasn't been resolved. For external async io, this is probably the way it works out, and timely doesn't really help you here (in the way that perhaps Tokio would).

So, I think you could write a vertex that "pretends" to be some async service, where you send it messages describing requests, it starts up the work and sends responses when they complete. Timely wouldn't be doing anything especially helpful here, and I guess what would be neat here is that it could interop with the rest of a dataflow computation.

Does this help?