Closed jstammers closed 5 months ago
I am not surprised by the poor performance. If you look at the SQL that gets generated from this query, it is a nasty nasty thing. Sorta the price we're paying for the ergonomics of ibis. It mostly comes down to the API I have been trying to support of "one row per record, with terms going in an array column". SQL isn't really designed around this, it is better to have unnested tables, eg with one row per term.
Thanks for doing the benchmarking here. I've been meaning to do something like this. If the code you have is moderately reproducible, it would be great if you could upload it here, perhaps I will get to adding it as a benchmark (no guarantees though), and it will make it much more likely for me to actually dive into this and improve it. I wouldn't expect me to get to this anytime soon. As a workaround, can you do the tfidf step as a UDF with sklearn?
Here's a slight modification that makes use of an in-built dataset
import ibis
from ibis import _
import pandas as pd
from mismo.sets import add_tfidf
from time import time
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from mismo.datasets import load_patents
df = load_patents()
df = df.mutate(class_tokens=_.classes.split("**"))
dfs = [df] * 100
# Oversample to make the dataset bigger
df = df.union(*dfs).cache()
def add_tfidf_sk(df, column):
v = TfidfVectorizer()
def tokenize(x):
if x is None or len(x) == 0:
return ""
return " ".join(x)
corpus = df[column].to_pandas().map(tokenize).to_list()
X = v.fit_transform(corpus)
feature_names = v.get_feature_names_out()
# Get the indices and data from the sparse matrix
coo = X.tocoo()
keys = np.split(feature_names[coo.col],
np.unique(coo.row, return_index=True)[1][1:])
values = np.split(coo.data, np.unique(coo.row, return_index=True)[1][1:])
# Create the final dictionary
tfidf_dict = {"keys": keys, "values": values}
t = ibis.memtable(tfidf_dict)
# Create MapColumn and join onto original table using row_number
idf_m = ibis.map(keys=t.keys, values=t.values).name(f"{column}_tfidf")
idf_table = idf_m.as_table()
idf_table = idf_table.mutate(__id=ibis.row_number())
df = df.mutate(__id=ibis.row_number())
df = df.join(idf_table, "__id")
return df.drop("__id")
def timeit(f):
start = time()
f()
end = time()
print(f"Time taken: {end - start}")
return end - start
n = np.power(10, np.linspace(1, 5, 50)).astype(int)
times = [timeit(lambda: add_tfidf(df.limit(i), "class_tokens").execute()) for i in n]
times_2 = [timeit(lambda: add_tfidf_sk(df.limit(i), "class_tokens").execute()) for i in n]
benchmark = pd.DataFrame({"n": n, "time_ibis": times, "time_sk": times_2})
The difference here isn't quite as drastic, which I suspect is due to the fact that the number of unique words here is much lower than in my actual dataset, which has around 75k words in its vocabulary. This makes me wonder if it's related to the operation here that uses the entire idf_map and term_counts per row.
Using the following implementation, I've found that the runtime performance is much closer to my version that uses TfIdfVectorizer
def add_tfidf(
t,
column: str,
*,
result_name: str = "{name}_tfidf",
normalize: bool = True,
):
with_counts = add_array_value_counts(t, column, result_name="__term_counts")
if normalize:
with_counts = with_counts.mutate(
__term_counts=vector.normalize(with_counts.__term_counts)
)
idf = term_idf(t[column])
with_counts = with_counts.mutate(__row_number=ibis.row_number())
term_counts = with_counts.select("__row_number", "__term_counts")
term_counts = term_counts.mutate(
term=_.__term_counts.keys().unnest(),
term_count=_.__term_counts.values().unnest(),
).drop("__term_counts")
idf_table = term_counts.join(idf, "term")
idf_table = idf_table.mutate(idf=_.idf * _.term_count).drop("term_count")
r = result_name.format(name=column)
idf_table = idf_table.group_by("__row_number").aggregate(
ibis.map(idf_table.term.collect(), idf_table.idf.collect()).name(r)
)
result = with_counts.join(idf_table, "__row_number").drop(
"__row_number", "__term_counts"
)
return result
On the patents dataset, seems to perform well
And it performs much better on my dataset, with only a small linear overhead when compared to the sklearn version
It may be possible to optimise this further, but let me know if you'd like me to submit this as a PR to review
Thanks so much @jstammers ! I will take a deeper look when I'm at a computer, but it seems like if you were able to keep the same API and it just is simply faster, seems like an obviously good change to make :)
I'm curious what you think about the API in general though. Do you think it's important to keep it "one row per record", or if we could get more performance out of it would you want to go to a "one row per term" model? That might influence the rest of the bag-of-words APIs so they all interoperate nicely. I'm curious if you are doing other cleaning/transformation steps to the data outside of mismo, how well the current APIs are suiting your needs. If you could share your entire script (no need for it to be clean or reproducible, just enough for me to get the gist) and any pain points you've found, perhaps we can make your life easier
@jstammers I went with your suggested fix, except I had to add some fixup code of
result = ...
empty = ibis.literal({}, type=result[r].type())
result = result.mutate((_[column].length() == 0).ifelse(empty, _[r]).name(r))
return result
to deal with the case of passing in an empty list of terms. Without this fixup, pytest -vv mismo/sets/tests/test_tfidf.py::test_add_tfidf
was failing.
Can you try your benchmarks on this updated version? I don't think think this change should affect the performance at all but want to be sure.
Thank you so much for the detailed repro and the suggested fixup! those things really help me!
Fascinating thread y'all! @jstammers @NickCrews Thanks for persevering!
I've noticed that
add_tfidf
seems to scale poorly with larger datasets.On one of my datasets, the execution time seems to scale approximately linearly, but it takes far longer than is practical.
For a dataset of around 1M records, I have found that this implementation does not complete after an hour and the process ultimately crashes due to running out of memory when calling
df.execute()
In comparison, I've implemented this following function using sklearn's
TFIdfVectorizer
which scales much better to larger datasets
although is not backend-agnostic.
If you have any thoughts on how this implementation could be sped-up, I'd be happy to discuss them further. Right now, I'm not sure it is sufficiently fast enough for my use-case