Open snth opened 13 hours ago
Hi @snth, thanks a lot, happy to see you here, and great question! (I'm not a power user of those libraries so correct me if I'm wrong!)
I want to outline the fundamental design of each library to provide insight into how it influences the way you structure your logic when working with them:
streamz
's Stream
instance is an acyclic tree of operations, root being source and leaves being sinks. Stream.emit
allows to push an element through this tree down to every leaf.RxPY
implements an Observer Pattern: You compose operations to form a subject, onto which you can subscribe callback functions that will be notified/called when an incoming element has been operated and is ready for final consumption.streamable
's Stream
is a decorator for iterables: i.e. a Stream[T]
is initiated from an Iterable[T]
and it is itself an Iterable[T]
. It exposes chainable lazy operations (stream's methods), each returning a new child stream referencing its parent. It is trivial to integrate as you can init a stream from any iterable in your codebase and throw it at any function accepting an iterable. To make the learning curve zero flat, I wanted the interface to feel both pythonic and familiar to how one manipulate collections in functional languages.More comparison material: if you don't come from there you should check the reddit post (Especially the Comparison section at the end) and this comment.
where you want streamable to go in future?
I leveraged it at my previous job to implement 30 custom ETL pipelines that are running in production for a year now. So I have the responsability to maintain its quality over years.
I am glad it is now in the feedback phase, gathering some "this would be a cool choice for my use case but it misses that feature" or "how to implement this use case?". I am grateful that other contributors are starting to come into the loop, you are more than welcome! 🫡 . Let's make it a useful, intuitive and robust lib for everybody enjoying top-to-bottom-read iterable operations, out-of-the-box concurrency, and any other PullRequestable feature 🤓.
The only constraint is to keep this library as unopinionated as possible and minimize its responsabilities i.e. this would be against the philosophy of the library:
Stream.from_csv("sales.csv")
.join(db_type="postgres", db_name="main", schema_name="public", table="user", on_keys=("user_id",))
.to_bigquery("enriched_sales", partition=datetime.date.today(), batchsize=1024)
but instead, one can for example:
Iterable[Dict[str, Any]]
of input rows using csv
module.map
using a psycopg2
client.group(size=1024)
.foreach
using a bigquery.Client.insert_rows_json
I hope it makes sense 🙏🏻 !
Hi,
First off, this project looks really cool! I'm sorry to be starting with an issue asking for comparisons with similar projects but while I've long been a fan of stream processing type patters, I also want to limit how many projects I keep track off so it would be great to know how you see streamable viz a viz similar projects and where you want streamable to go in future?
Things in my current repertoire that look similar on the surface are: