NickCrews / mismo

The SQL/Ibis powered sklearn of record linkage
https://nickcrews.github.io/mismo/
GNU Lesser General Public License v3.0
14 stars 3 forks source link

benchmarks for array.filter(x -> x.isin(<column from other relation>)) #32

Closed NickCrews closed 8 months ago

NickCrews commented 8 months ago

Just writing down my findings when trying to work around https://stackoverflow.com/questions/77559936/how-to-implementlist-filterarray-elem-elem-in-column-in-other-table

This solution uses an unnest, semi join, collect, then rejoin. It took ~5 sec. This requires a bad API that takes the entire table and returns a modified table.

from mismo import _util

def array_filter_isin_other(
    t: ir.Table,
    array: ir.ArrayColumn | str,
    other: ir.Column,
    *,
    result_format: str = "{name}_filtered",
) -> ir.Table:
    """
    Equivalent to t.mutate(result_name=t[array].filter(lambda x: x.isin(other_table[<single_column>])))

    We can't have subqueries in the filter lambda (need this to avoid https://stackoverflow.com/questions/77559936/how-to-implementlist-filterarray-elem-elem-in-column-in-other-table)
    """  # noqa E501
    other_table = other.name("other").as_table()
    array_col = _util.get_column(t, array)
    t = t.mutate(__array=array_col, __id=ibis.row_number())
    temp = t.select("__id", __unnested=_.__array.unnest())
    filtered = temp.semi_join(other_table, temp.__unnested == other_table.other)
    re_agged = filtered.group_by("__id").agg(__filtered=_.__unnested.collect())
    re_joined = t.join(re_agged, "__id").drop(["__id", "__array"])
    result_name = result_format.format(name=array_col.get_name())
    return re_joined.rename({result_name: "__filtered"})

array_filter_isin_other_pandas(rec_prepped, "donor_addresses_tokens", rt.term)

This solution just executes the haystack, so that the haystack terms are literally templated into the SQL. It took about 10 seconds. This allows for a nicer API of just (Array, Column) -> Array (although here I wrap it in a function with the same API as the above, so they are easily swappable).

def array_filter_isin_other_pandas(
    t: ir.Table,
    array: ir.ArrayColumn | str,
    other: ir.Column,
    *,
    result_format: str = "{name}_filtered",
) -> ir.Table:
    array_col = _util.get_column(t, array)
    result_name = result_format.format(name=array_col.get_name())
    other_series = other.execute()
    return t.mutate({result_name: array_col.filter(lambda x: x.isin(other_series))})

array_filter_isin_other_pandas(rec_prepped, "donor_addresses_tokens", rt.term)

Basically I think my finding here is that I'd rather have the slower one for the better API.

NickCrews commented 8 months ago

oops, these timings were just for repr-ing the first 10 rows. if I actually .cache() the result, then the pandas implementation takes 3.5 minutes, but the semi_join implementation still is just 10 seconds.