Python and guvectorize query language performance can be slow

chipkent commented 1 year ago

A user asked me to examine the performance of the grouping_agg function in the code below. He believed that "slow" performance he was seeing was the result of group_by followed by ungroup. I created the following benchmarks to examine performance. These benchmarks indicate the group_by followed by ungroup is efficient, but how logic is represented as functions can make a 7x performance difference.

The numbers below were generated on a v0.29 release candidate.

# Example function

from typing import Union, Sequence, Tuple
from deephaven.table import Table
from deephaven.updateby import rolling_group_tick
from math import sqrt

def grouping_agg(t: Table, by: Sequence[str], formulas=Sequence[str]) -> Table:
    rst_cols = []
    rst_cols.extend(by)
    rst_cols.extend([f.split("=")[0].strip() for f in formulas])

    rst = t.group_by(by).update(formulas).view(rst_cols)
    return rst

# Example query

from typing import List
from deephaven.column import string_col, long_col, double_col
from deephaven import empty_table, new_table
from deephaven import agg
from datetime import datetime
import numba as nb
import numpy as np

def custom_func_python(x, y) -> float:
    return sum(x) + sqrt(len(y))

def custom_func_numpy(x, y) -> float:
    return np.sum(x) + sqrt(len(y))

@nb.guvectorize([(nb.float64[:],nb.float64[:],nb.float64[:])],"(m),(m)->(m)",nopython=True)
def custom_func_numba(x, y, rst):
    rst[:] = sum(x) + sqrt(len(y))

@nb.guvectorize([(nb.float64[:],nb.float64[:],nb.float64[:])],"(m),(m)->(m)",nopython=True)
def custom_func_numbanumpy(x, y, rst):
    rst[:] = np.sum(x) + sqrt(len(y))

def run_it(label: str, formulas: List[str], n_row: int, n_group: int, n_repeat: int):
    t = empty_table(n_row).update(["Id=ii%n_group", "Offset=ii%5", "Value1=random()", "Value2=random()", "OtherCol=1"])

    start = datetime.now()
    for i in range(n_repeat):
        t1 = grouping_agg(t, by=["Id"], formulas=formulas)
    stop = datetime.now()
    dt = stop - start
    sec_per_eval = dt.total_seconds() / n_repeat
    ns_per_row = sec_per_eval / n_row * 1e9
    print(f"TIME: {n_row} {n_group} {label}:\t{ns_per_row:.2f} ns/row")
    return (label, n_row, n_group, ns_per_row)

n_rows = [100_000, 1_000_000, 10_000_000,]
n_groups = [2, 20, 200, 2000]
n_repeat = 10

data = []

for n_group in n_groups:
    for n_row in n_rows:
        data.append(run_it("Java+BuiltIn", ["F=sum(Value1)+sqrt(Value2.size())"], n_row=n_row, n_group=n_group, n_repeat=n_repeat))
        data.append(run_it("Custom+Py+Cast", ["F = (double) custom_func_python(Value1,Value2)"], n_row=n_row, n_group=n_group, n_repeat=n_repeat))
        data.append(run_it("Custom+Py", ["F = custom_func_python(Value1,Value2)"], n_row=n_row, n_group=n_group, n_repeat=n_repeat))
        data.append(run_it("Custom+Numpy", ["F = custom_func_numpy(Value1,Value2)"], n_row=n_row, n_group=n_group, n_repeat=n_repeat))
        data.append(run_it("Custom+Numba", ["F = custom_func_numba(Value1,Value2)"], n_row=n_row, n_group=n_group, n_repeat=n_repeat))
        data.append(run_it("Custom+NumbaNumpy", ["F = custom_func_numbanumpy(Value1,Value2)"], n_row=n_row, n_group=n_group, n_repeat=n_repeat))

perf = new_table([
    string_col("Label", [x[0] for x in data]),
    long_col("NRow", [x[1] for x in data]),
    long_col("NGroup", [x[2] for x in data]),
    double_col("NSperRow", [x[3] for x in data]),
])

There are some interesting things to note: 1) Python to Java typecasting does not appear to make a material performance difference. This casting should be happening in both cases.

2) Performance is reasonably consistent across different numbers of rows, once the number of rows is sufficiently large.

3) Performance is reasonably consistent across different numbers of groups.

4) Performance is highly variable depending upon how logic is represented.

Java built-in performance is by far the best (baseline)
Python vector operations are 7x slower than baseline.
Using numpy functions is about 2x slower than baseline.
Using guvectorize is 5x slower than baseline and 3x slower than numpy. This is surprising and should be looked at.
guvectorize produces the same performance if the python or numpy method is used. This suggests that the same code is generated.

This data suggests that:

Users need to be very careful in choosing how to represent a function to get maximum performance. What guidance can we provide?
Python performance may have the potential for improvement.
guvectorize performance may have the potential for improvement.

jmao-denver commented 1 year ago

After much struggle with the environment on my laptop, I ran some tests with a slightly modified version of @chipkent 's code and got this result. The conclusion: the performance hit by applying the dh_null_to_nan decorator is surprisingly small 10 - 15%. However with dh_null_to_na (to Pandas series), it is much worse, which can be explained by the fact that

We first convert the Java array to numpy array, then build a series from the numpy array indicating no-copy should be honored as much as possible.
Pandas Series can do a lot more than the raw numpy array and thus the overhead

The same result can be seen across the spectrum of group numbers, row counts.

Based on these results, I think we probably should avoid providing the option of auto-conversion-to-pd-na and simply auto converting Java arrays to numpy arrays and auto applying the null conversion. This will make

Avi happy
for users who want to receive Pandas series, they have the option to create one from the numpy array in the UDF itself, although for integer types and boolean, the float64 promotion introduces unnecessary overhead. A solution to this is to keep the dh_null_to_na decorator and check for its application at runtime to determine if numpy array conversion should be automatically run.

No conversion

@dh_null_to_nan

@dh_null_to_na

jmao-denver commented 1 year ago

I ran into a case where the UDF is type annotated to receive numpy int64 ndarray as an input parameter and return the same type but instead of getting a float64 ndarray and returning a float64 ndarray. This made me think that the 'implicit' auto-null conv and its side effect of auto promotion integer arrays to float64 arrays is really too aggressive and will trip a lot of users. So it seems that having a auto_null_conv decorator to force explicit use of it is still the way to go.

jmao-denver commented 7 months ago

I added a new variant to @chipkent 's original script of the custom Python function with type hints and ran it under the https://github.com/deephaven/deephaven-core/pull/5291. The key results are:

The full output:

deephaven / deephaven-core

Python and guvectorize query language performance can be slow #4635