Altered PIT statement for performance improvements

Rewriting the PIT Macro to increase performance in Databricks

Background

At our company we recently started adding PIT tables to our BDV layer, and have found that some of our larger, core concepts were having quite fatal performance issues.

Our Customer concept formed a PIT for 7 satellites, and was taking about 3.5 hours to write the table in Databricks (~200M rows), and another concept which tried to form a PIT for 18 satellites ran for 16 hours before we had to kill it (never completed - ~84M rows).

Changes

The way the PIT macro is currently set up joins the hub + date spline to each satellite on several rows, and then aggregates over these rows to find whether a key + date exist in this range. For large tables this causes memory issues and in our particular databricks setup causes data to spill to disk. The proposed change creates a CTE for each satellite which adds a "next" LDTS field (using a LEAD function) so that the join from hub + date spline to the satellites produces the exact row (if any) which matches the criteria of key = key and as_of_date between sat.ldts and sat.next_ldts.

Results

In practice we have found that this change lead to a dramatic improvement in performance. The PIT for our customer concept which previous took ~3.5hrs is now completing in just over 5minutes (~40x speedup), and the PIT for our other concept which never finished (but ran for at least 16hrs) now finishes in ~35min.

N.B. I haven't gone to the effort of rewriting this for other platforms, but hopefully a similar structure is just as useful for them too. Please reach out if you want to discuss!

Datavault-UK / automate-dv