This solution uses an unnest, semi join, collect, then rejoin. It took ~5 sec. This requires a bad API that takes the entire table and returns a modified table.
from mismo import _util
def array_filter_isin_other(
t: ir.Table,
array: ir.ArrayColumn | str,
other: ir.Column,
*,
result_format: str = "{name}_filtered",
) -> ir.Table:
"""
Equivalent to t.mutate(result_name=t[array].filter(lambda x: x.isin(other_table[<single_column>])))
We can't have subqueries in the filter lambda (need this to avoid https://stackoverflow.com/questions/77559936/how-to-implementlist-filterarray-elem-elem-in-column-in-other-table)
""" # noqa E501
other_table = other.name("other").as_table()
array_col = _util.get_column(t, array)
t = t.mutate(__array=array_col, __id=ibis.row_number())
temp = t.select("__id", __unnested=_.__array.unnest())
filtered = temp.semi_join(other_table, temp.__unnested == other_table.other)
re_agged = filtered.group_by("__id").agg(__filtered=_.__unnested.collect())
re_joined = t.join(re_agged, "__id").drop(["__id", "__array"])
result_name = result_format.format(name=array_col.get_name())
return re_joined.rename({result_name: "__filtered"})
array_filter_isin_other_pandas(rec_prepped, "donor_addresses_tokens", rt.term)
This solution just executes the haystack, so that the haystack terms are literally templated into the SQL. It took about 10 seconds. This allows for a nicer API of just (Array, Column) -> Array (although here I wrap it in a function with the same API as the above, so they are easily swappable).
oops, these timings were just for repr-ing the first 10 rows. if I actually .cache() the result, then the pandas implementation takes 3.5 minutes, but the semi_join implementation still is just 10 seconds.
Just writing down my findings when trying to work around https://stackoverflow.com/questions/77559936/how-to-implementlist-filterarray-elem-elem-in-column-in-other-table
This solution uses an unnest, semi join, collect, then rejoin. It took ~5 sec. This requires a bad API that takes the entire table and returns a modified table.
This solution just executes the haystack, so that the haystack terms are literally templated into the SQL. It took about 10 seconds. This allows for a nicer API of just (Array, Column) -> Array (although here I wrap it in a function with the same API as the above, so they are easily swappable).
Basically I think my finding here is that I'd rather have the slower one for the better API.