databrickslabs / tempo

API for manipulating time series on top of Apache Spark: lagged time values, rolling statistics (mean, avg, sum, count, etc), AS OF joins, downsampling, and interpolation
https://pypi.org/project/dbl-tempo
Other
309 stars 53 forks source link

Test out early stop sort merge join to handle AS OF join? #360

Open CTCC1 opened 1 year ago

CTCC1 commented 1 year ago

I ran into some online benchmarks about AS OF join where in certain cases, "early stop sort merge join" can outperform UNION based AS OF join.

https://www.hopsworks.ai/post/a-spark-join-operator-for-point-in-time-correct-joins (fwiw, it mentioned tempo as the inspiration for the UNION based AS OF join)

open sourced implementations https://github.com/Ackuq/spark-pit/blob/main/scala/src/main/scala/execution/Patterns.scala

Would be interested to see what the community / maintainers think.

tnixon commented 1 year ago

Thank you @CTCC1 for bringing this to our attention! We do plan to set up a more formal process for performance testing our functions and As-Of Joins are at the top of that list. We're also doing a big refactoring of the code that will make it easier to compare different implementations head-to-head, so this is very useful information, thanks!

Tom-Newton commented 4 months ago

I never tested this library but its implementation is quite similar to an in house implementation we previously used. Switching to spark PIT gave us a 20X speedup for the asof join stage on one of our workloads.

I will mention though that spark PIT is totally un-maintained and it does have quite a lot of bugs. I ended up creating a fork which fixed all the bugs I found https://github.com/Tom-Newton/spark-pit.