databrickslabs / tempo

API for manipulating time series on top of Apache Spark: lagged time values, rolling statistics (mean, avg, sum, count, etc), AS OF joins, downsampling, and interpolation
https://pypi.org/project/dbl-tempo
Other
306 stars 52 forks source link

Constructing AsOfJoin gets very slow as number of columns increases #362

Closed tnixon closed 1 year ago

tnixon commented 1 year ago

Note this issues arose as a consequence of Databricks issue ES-855980

The construction of a joined DataFrame using the TSDF.asofJoin method gets very slow as the number of columns gets large. It appears (based on preliminary testing) to be growing with the square of the number of columns (see attached plot). This does not involve actually executing the as-of join, just calling the asofJoin function, which just constructs a new TSDF object by chaining operations onto the DAG for the underlying Spark DataFrame. It really should not take this long. newplot

It appears (again, based on preliminary investigation) that the cause of this comes from the use of Python's reduce(...) function to chain a sequence of .withColumn(...) calls onto the DataFrame. This occurs in 4 places in the __getLastRightRow(...) helper function:

It seems like this pattern does not scale well with the number of columns. We should try some alternatives to speed up the execution of the asofJoin(...) function in constructing these queries. A couple of initial thoughts:

tnixon commented 1 year ago

Refactoring the reduce(...) to a for-loop did cut the time a bit, but not enough. Doing this and moving from many .withColumn(...) calls to a single large .select(...) brings the time to run the function down by at least an order of magnitude. See the following plot: newplot (1) note that the blue line in this plot does not represent the same quantity as in the previous plot

Execution time is still growing with the # of columns (as would be expected: we need to operate on each of these with custom expressions), but it is considerably less steep than before. It is a bit troubling that the curve appears to still be quadratic, and not linear with the number of columns. Perhaps there are other places where the function could be improved.

tnixon commented 1 year ago

Ok, after running it through the debugger again, I've found that with this fix the __getLastRightRow(...) is very fast, but there is still considerable time being spent in 2 other helper functions:

tnixon commented 1 year ago

Ok, with those 2 functions refactored, the performance now appears to be linear wrt. the number of columns: newplot (2)

... and significantly faster than the previous implementation: newplot (3)

tnixon commented 1 year ago

Closed with merge of #363