Use cases for timely dataflow

opensourcegeek commented 7 years ago

I've been working on some data streaming mechanisms from devices (like IoT) into data bases. The on-device part of data generator is implemented in Rust. Currently using MQTT for transport of the data and then held in a queue (rabbitmq or cloud pub/sub).

I'm currently evaluating Apache beam to push data into Google's BigQuery which I've been using already and hugely impressed with. I stumbled across this project as I was looking for a Rust alternative. I haven't fully gone through your documentation or series blog post (apologies!) but before that I have few queries

Am I correct to assume timely is addressing the same use cases as spark or storm or beam like projects?

Also, could you have computations running in separate nodes?

frankmcsherry commented 7 years ago

Am I correct to assume timely is addressing the same use cases as spark or storm or beam like projects?

Roughly, yeah. The same computations, and often more general than what they can handle.

Though you could say that the Apache stuff is targeting use cases that are more failure-prone, and need to be .. I don't know "web scale". The timely stuff has been more aligned with on-prem installations where if your machines start going down something is wrong in your building and you need to get out. For example, the fault-tolerance story is roughly "timely gives you accurate conservative progress information, and is fail-stop in the case of errors". This means you often want it to be the compute between stable storage (e.g., Kafka) rather than a reliable source of truth itself. There is on-going work here, but it is a performance trade-off, and you get order+ of magnitude by not having FT built in when you don't need it.

Also, could you have computations running in separate nodes?

The same program runs single-thread, multi-threaded, multi-machined, and it does all of the serialization and communication transparently.

frankmcsherry commented 7 years ago

If you are using Beam mostly to do scalable ETL just to get your data into BigQuery, or mostly for that, I would recommend sticking with it for now. The timely benefits are mostly in raw performance, rather than fit and finish and integration with existing services. If your experience with Beam was "neat but too slow" then timely could make sense; if you are looking for a Rust version of the same experience, .. well check it out but I'd personally probably stick with what works. :)

frankmcsherry commented 7 years ago

All that being said, if you want to try it out we'll help, and it would be cool to learn about what sucks. :D

opensourcegeek commented 7 years ago

We're still evaluating cloud dataflow so it's not yet in production.

The pricing model for Beam (or mainly cloud dataflow) isn't very attractive at the moment. Any time a job runs slow in the pipeline which could be because of programming error in code could result in paying (significantly) more. So at the very least, I'd like to try timely to see whether it could be a good fit. Thanks for offer to help - I'm sure I'd need some help when I'm starting! :)

Could you direct me to part of document that explains how timely can run on "multi-machine" nodes please?

frankmcsherry commented 7 years ago

There's not too much to say about distributed execution. The main thing is that you need to specify addresses and ports, and start the processes (timely doesn't speak ZK or anything like that). Again, for on-prem stuff this is usually pretty easy, but it can be painful in VM settings if the infrastructure doesn't want to hand out hostnames.

Details here: https://github.com/frankmcsherry/timely-dataflow#execution

If there is something that sounds like it's missing, let me know. When you own the machines it really is this simple, but there might be lots of missing steps if you have your hands tied by the provider.

TimelyDataflow / timely-dataflow

Use cases for timely dataflow #101