Closed tnixon closed 1 year ago
Refactoring the reduce(...)
to a for-loop did cut the time a bit, but not enough. Doing this and moving from many .withColumn(...)
calls to a single large .select(...)
brings the time to run the function down by at least an order of magnitude. See the following plot:
note that the blue line in this plot does not represent the same quantity as in the previous plot
Execution time is still growing with the # of columns (as would be expected: we need to operate on each of these with custom expressions), but it is considerably less steep than before. It is a bit troubling that the curve appears to still be quadratic, and not linear with the number of columns. Perhaps there are other places where the function could be improved.
Ok, after running it through the debugger again, I've found that with this fix the __getLastRightRow(...)
is very fast, but there is still considerable time being spent in 2 other helper functions:
__addPrefixToColumns
__addColumnsFromOtherDF
both of which also use the reduce( withColumn( ... ))
pattern. So I'm going to refactor those as well.Ok, with those 2 functions refactored, the performance now appears to be linear wrt. the number of columns:
... and significantly faster than the previous implementation:
Closed with merge of #363
Note this issues arose as a consequence of Databricks issue ES-855980
The construction of a joined DataFrame using the
TSDF.asofJoin
method gets very slow as the number of columns gets large. It appears (based on preliminary testing) to be growing with the square of the number of columns (see attached plot). This does not involve actually executing the as-of join, just calling the asofJoin function, which just constructs a new TSDF object by chaining operations onto the DAG for the underlying Spark DataFrame. It really should not take this long.It appears (again, based on preliminary investigation) that the cause of this comes from the use of Python's
reduce(...)
function to chain a sequence of.withColumn(...)
calls onto the DataFrame. This occurs in 4 places in the__getLastRightRow(...)
helper function:It seems like this pattern does not scale well with the number of columns. We should try some alternatives to speed up the execution of the
asofJoin(...)
function in constructing these queries. A couple of initial thoughts:reduce(...)
to a simple for-loop over relevant columns.withColumn(...)
calls to a single big.select(...)
over a list of relevant expresssions