51zero / eel-sdk

Big Data Toolkit for the JVM
Apache License 2.0
145 stars 35 forks source link

Set Theory / Relational operators on a Datastream #310

Closed hannesmiller closed 7 years ago

hannesmiller commented 7 years ago

Set Theory / Relational operators on a Datastream:

All very useful performing transformations.

Hopefully the above is self explanatory but here's a link for any help/ideas:

http://training.databricks.com/visualapi.pdf

The thing to think about is the RHS set should already be materialised, and broadcasted to all EEL IO workers.

Spark for example has a broadcast function which can be used with join on a DataFrame:

df.join(broadcast(otherDf), ...)
sksamuel commented 7 years ago

Yes good idea, it's along the lines of the join operation that was added in 1.2.

sksamuel commented 7 years ago

We already have union - we have both join, concat, and union operations. I've added Cartesian, that seems really useful. I've also added subtract and intersection. I'm not sure what what the others really mean in the context of eel. For example, what would zip do. And unique would mean having to buffer the entire lot, so goes against how a datastream works. So I'm happy that with join, concat, union, intersection, cartesian, subtract we've got a good set!

hannesmiller commented 7 years ago

Yeah you've got the fundamentals there - zip and unique could be used if we had sliding windows, i.e. If we had streaming.

However that's not what we are.

sksamuel commented 7 years ago

We do have streaming. Just really big windows :)