Open CTCC1 opened 1 year ago
Thank you @CTCC1 for bringing this to our attention! We do plan to set up a more formal process for performance testing our functions and As-Of Joins are at the top of that list. We're also doing a big refactoring of the code that will make it easier to compare different implementations head-to-head, so this is very useful information, thanks!
I never tested this library but its implementation is quite similar to an in house implementation we previously used. Switching to spark PIT gave us a 20X speedup for the asof join stage on one of our workloads.
I will mention though that spark PIT is totally un-maintained and it does have quite a lot of bugs. I ended up creating a fork which fixed all the bugs I found https://github.com/Tom-Newton/spark-pit.
I ran into some online benchmarks about AS OF join where in certain cases, "early stop sort merge join" can outperform UNION based AS OF join.
https://www.hopsworks.ai/post/a-spark-join-operator-for-point-in-time-correct-joins (fwiw, it mentioned tempo as the inspiration for the UNION based AS OF join)
open sourced implementations https://github.com/Ackuq/spark-pit/blob/main/scala/src/main/scala/execution/Patterns.scala
Would be interested to see what the community / maintainers think.