NickCrews / mismo

The SQL/Ibis powered sklearn of record linkage
https://nickcrews.github.io/mismo/
GNU Lesser General Public License v3.0
14 stars 3 forks source link

Poor scaling of add_tfidf to larger datasets #42

Closed jstammers closed 5 months ago

jstammers commented 6 months ago

I've noticed that add_tfidf seems to scale poorly with larger datasets.

On one of my datasets, the execution time seems to scale approximately linearly, but it takes far longer than is practical. image

For a dataset of around 1M records, I have found that this implementation does not complete after an hour and the process ultimately crashes due to running out of memory when calling df.execute()

In comparison, I've implemented this following function using sklearn's TFIdfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
def add_tfidf_sk(df, column):
    v = TfidfVectorizer()
    corpus = df[column].to_pandas().map(lambda x: " ".join(x)).to_list()
    X = v.fit_transform(corpus)
    feature_names = v.get_feature_names_out()

    # Get the indices and data from the sparse matrix
    coo = X.tocoo()

    keys = np.split(feature_names[coo.col], np.unique(coo.row, return_index=True)[1][1:])
    values = np.split(coo.data, np.unique(coo.row, return_index=True)[1][1:])

    # Create the final dictionary
    tfidf_dict = {"keys": keys, "values": values}
    t = ibis.memtable(tfidf_dict)

    # Create MapColumn and join onto original table using row_number
    idf_m = ibis.map(keys=t.keys, values=t.values).name(f"{column}_tfidf")
    idf_table = idf_m.as_table()
    idf_table = idf_table.mutate(__id=ibis.row_number())
    df = df.mutate(__id=ibis.row_number())
    df = df.join(idf_table, "__id")
    return df.drop("__id")

which scales much better to larger datasets image

although is not backend-agnostic.

If you have any thoughts on how this implementation could be sped-up, I'd be happy to discuss them further. Right now, I'm not sure it is sufficiently fast enough for my use-case

NickCrews commented 6 months ago

I am not surprised by the poor performance. If you look at the SQL that gets generated from this query, it is a nasty nasty thing. Sorta the price we're paying for the ergonomics of ibis. It mostly comes down to the API I have been trying to support of "one row per record, with terms going in an array column". SQL isn't really designed around this, it is better to have unnested tables, eg with one row per term.

Thanks for doing the benchmarking here. I've been meaning to do something like this. If the code you have is moderately reproducible, it would be great if you could upload it here, perhaps I will get to adding it as a benchmark (no guarantees though), and it will make it much more likely for me to actually dive into this and improve it. I wouldn't expect me to get to this anytime soon. As a workaround, can you do the tfidf step as a UDF with sklearn?

jstammers commented 6 months ago

Here's a slight modification that makes use of an in-built dataset

import ibis
from ibis import _
import pandas as pd
from mismo.sets import add_tfidf
from time import time
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from mismo.datasets import load_patents

df = load_patents()
df = df.mutate(class_tokens=_.classes.split("**"))

dfs = [df] * 100
# Oversample to make the dataset bigger
df = df.union(*dfs).cache()

def add_tfidf_sk(df, column):
    v = TfidfVectorizer()
    def tokenize(x):
        if x is None or len(x) == 0:
            return ""
        return " ".join(x)
    corpus = df[column].to_pandas().map(tokenize).to_list()
    X = v.fit_transform(corpus)
    feature_names = v.get_feature_names_out()

    # Get the indices and data from the sparse matrix
    coo = X.tocoo()

    keys = np.split(feature_names[coo.col],
                    np.unique(coo.row, return_index=True)[1][1:])
    values = np.split(coo.data, np.unique(coo.row, return_index=True)[1][1:])

    # Create the final dictionary
    tfidf_dict = {"keys": keys, "values": values}
    t = ibis.memtable(tfidf_dict)

    # Create MapColumn and join onto original table using row_number
    idf_m = ibis.map(keys=t.keys, values=t.values).name(f"{column}_tfidf")
    idf_table = idf_m.as_table()
    idf_table = idf_table.mutate(__id=ibis.row_number())
    df = df.mutate(__id=ibis.row_number())
    df = df.join(idf_table, "__id")
    return df.drop("__id")

def timeit(f):
    start = time()
    f()
    end = time()
    print(f"Time taken: {end - start}")
    return end - start

n = np.power(10, np.linspace(1, 5, 50)).astype(int)

times = [timeit(lambda: add_tfidf(df.limit(i), "class_tokens").execute()) for i in n]
times_2 = [timeit(lambda: add_tfidf_sk(df.limit(i), "class_tokens").execute()) for i in n]

benchmark = pd.DataFrame({"n": n, "time_ibis":  times, "time_sk": times_2})

The difference here isn't quite as drastic, which I suspect is due to the fact that the number of unique words here is much lower than in my actual dataset, which has around 75k words in its vocabulary. This makes me wonder if it's related to the operation here that uses the entire idf_map and term_counts per row.

https://github.com/NickCrews/mismo/blob/ffb592c3e322cab30ac55fc64c273d988650a6d4/mismo/sets/_tfidf.py#L252-L257

image

jstammers commented 6 months ago

Using the following implementation, I've found that the runtime performance is much closer to my version that uses TfIdfVectorizer

def add_tfidf(
    t,
    column: str,
    *,
    result_name: str = "{name}_tfidf",
    normalize: bool = True,
):
    with_counts = add_array_value_counts(t, column, result_name="__term_counts")
    if normalize:
        with_counts = with_counts.mutate(
            __term_counts=vector.normalize(with_counts.__term_counts)
        )
    idf = term_idf(t[column])
    with_counts = with_counts.mutate(__row_number=ibis.row_number())
    term_counts = with_counts.select("__row_number", "__term_counts")
    term_counts = term_counts.mutate(
        term=_.__term_counts.keys().unnest(),
        term_count=_.__term_counts.values().unnest(),
    ).drop("__term_counts")
    idf_table = term_counts.join(idf, "term")
    idf_table = idf_table.mutate(idf=_.idf * _.term_count).drop("term_count")
    r = result_name.format(name=column)
    idf_table = idf_table.group_by("__row_number").aggregate(
        ibis.map(idf_table.term.collect(), idf_table.idf.collect()).name(r)
    )
    result = with_counts.join(idf_table, "__row_number").drop(
        "__row_number", "__term_counts"
    )
    return result

On the patents dataset, seems to perform well image

And it performs much better on my dataset, with only a small linear overhead when compared to the sklearn version image

It may be possible to optimise this further, but let me know if you'd like me to submit this as a PR to review

NickCrews commented 6 months ago

Thanks so much @jstammers ! I will take a deeper look when I'm at a computer, but it seems like if you were able to keep the same API and it just is simply faster, seems like an obviously good change to make :)

I'm curious what you think about the API in general though. Do you think it's important to keep it "one row per record", or if we could get more performance out of it would you want to go to a "one row per term" model? That might influence the rest of the bag-of-words APIs so they all interoperate nicely. I'm curious if you are doing other cleaning/transformation steps to the data outside of mismo, how well the current APIs are suiting your needs. If you could share your entire script (no need for it to be clean or reproducible, just enough for me to get the gist) and any pain points you've found, perhaps we can make your life easier

NickCrews commented 5 months ago

@jstammers I went with your suggested fix, except I had to add some fixup code of

    result = ...
    empty = ibis.literal({}, type=result[r].type())
    result = result.mutate((_[column].length() == 0).ifelse(empty, _[r]).name(r))
    return result

to deal with the case of passing in an empty list of terms. Without this fixup, pytest -vv mismo/sets/tests/test_tfidf.py::test_add_tfidf was failing.

Can you try your benchmarks on this updated version? I don't think think this change should affect the performance at all but want to be sure.

NickCrews commented 5 months ago

Thank you so much for the detailed repro and the suggested fixup! those things really help me!

cpcloud commented 5 months ago

Fascinating thread y'all! @jstammers @NickCrews Thanks for persevering!