MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.
https://maartengr.github.io/PolyFuzz/
MIT License
725 stars 68 forks source link

Memory error for Larger Datasets - Tf-IDF #43

Closed palaiTB closed 1 year ago

palaiTB commented 1 year ago

tfidf = TFIDF(n_gram_range=(3, 3)) model = PolyFuzz(tfidf)

MemoryError: Unable to allocate 91.0 GiB for an array with shape (X, Y) and data type float64.

X ~ 10k rows y ~ 1M rows

Is there any way to fix it? I've tried changing the pagefile size limit but in vain.

MaartenGr commented 1 year ago

Which version of PolyFuzz are you currently using?

Also, it might be worthwhile to use the fit and transform combo in order to reduce the necessary memory. You can find a bit more about that here. Simply fit on a subset of the data and then iteratively transform on chunks of the data.

palaiTB commented 1 year ago

I'm using the latest one available : 0.3.2

MaartenGr commented 1 year ago

In that case, using the fit and transform combo should reduce the memory errors that you have.

MaartenGr commented 1 year ago

One other thing, the latest one available is actually 0.4.0. I do not believe there are any performance fixes there with respect to the issue you are having but it might be worthwhile to try it out.

palaiTB commented 1 year ago

Alright. Using the latest version didn't work out. I'll try the fit and transform combo and update you.

palaiTB commented 1 year ago

RuntimeError: nnz of the result is too large

My train_words list is ~1M and I'm getting this error when trying to fit the model.

MaartenGr commented 1 year ago

Make sure that the train_words list is a subset of your data when using .fit. For example, 100.000 words or even less. Then, when you apply .train you can pass the words in increments. For example, sets of 50.000 words.

palaiTB commented 1 year ago

I tried to fit it on a dataset ~50k rows and here's the error :

MemoryError: Unable to allocate 18.6 GiB for an array with shape (50000, 50000) and data type float64.

MaartenGr commented 1 year ago

Could you share your exact code for doing this? It makes it a bit easier to see where the issue stems from if I can see how the model is created and fitted.

palaiTB commented 1 year ago

It's pretty straightforward :

from polyfuzz import PolyFuzz

model = PolyFuzz("TF-IDF")

train_words = df['column_name'].head(50000).astype('U').tolist()

model.fit(train_words)

I'm converting a dataframe column to list and then using it as train_words. Shouldn't be an issue in my opinion

MaartenGr commented 1 year ago

Did you install PolyFuzz through pip install polyfuzz[fast]? It generally uses less memory and often prevents these issues.

Something else you can try is the following:

from polyfuzz import PolyFuzz

model = PolyFuzz("TF-IDF")

train_words = df['column_name'].head(50000).astype('U').tolist()

model.fit(["empty doc"], train_words)

That way, the resulting matrix is 1x50000 instead of 50000x50000.

palaiTB commented 1 year ago

Alright. I installed through pip install polyfuzz==0.4.0

I'll try to use [fast]. I'll try out that snippet.

Thanks !

palaiTB commented 1 year ago

Make sure that the train_words list is a subset of your data when using .fit. For example, 100.000 words or even less. Then, when you apply .train you can pass the words in increments. For example, sets of 50.000 words.

Erm, AttributeError: 'PolyFuzz' object has no attribute 'train'

I couldn't find .train in the docs

MaartenGr commented 1 year ago

Oops, I meant the .transform function. Having said that, I think [fast] should already work quite a bit so maybe you wouldn't need the fit transform steps.

palaiTB commented 1 year ago

Okay, thanks but [fast] hasn't really been that effective.

So the issue is, my train_words is huge right - around 1M records. If I directly fit all 1M records, the step hasn't been able to complete. It's been 167 mins and the cell is still executing. I'm confused as to what you meant by specifying that the train_keyword list is a subset of data. The transform function is to basically a mapping of the unseen_words with the train_words isn't it? Or am I mistaken?

Ideally, I should be passing the unseen_words in increments to the transform function and that's understandable. How do I fit the entire 1M records of train_words?

MaartenGr commented 1 year ago

Alright, I took some time reading through the code and I believe there are two places for potential issues with using sparse matrix representations.

First, it is calculating the TF-IDF representation by itself. If you have 1M documents and for each of those documents you extract its representation then the resulting sparse matrix can be quite large. In your case, if I am not mistaken, seems to be a sparse matrix of size 1M x 1k which is quite large. It gets quite a big bigger due when you cast it to a dense format, which was my initial thought seeing the error you have:

MemoryError: Unable to allocate 91.0 GiB for an array with shape (X, Y) and data type float64.

This used to be an issue with an earlier version of PolyFuzz but now should be fixed.

Second, there is the act of calculating the cosine similarity between these large sparse matrices. Based on the above, the resulting similarity matrix would be 1M x 1M which is generally not possible and requires extreme amounts of RAM. However, the [fast] option is a solution as it does not keep the entire 1M x 1M similarity matrix in memory.

If I directly fit all 1M records, the step hasn't been able to complete. It's been 167 mins and the cell is still executing.

This might actually indicate that the procedure is working correctly. Imagine, for a single word, trying to find all similar words in a list of a million words, and then do that for each of the million words. That procedure can take quite a while and the [fast] option prevents from overloading the memory by trying to create the 1M x 1M similarity matrix.

I'm confused as to what you meant by specifying that the train_keyword list is a subset of data.

Apologies for the confusing, I was typing from my cellphone which understandably was rather short. I meant fitting on a subset of the data, for example 50.000 documents instead of the full 1M documents as there might be a way consider some parts of you data a subset. Typically, users have a smaller list of words that they want the input words to be mapped to and fitting on that smaller list is much faster. I am not familiar with your use case, so that might be a naive approach considering I am not aware of whether you goal involves a much smaller word set that you want the input words to be mapped to.

The transform function is to basically a mapping of the unseen_words with the train_words isn't it? Or am I mistaken?

In essence, yes. It calculates the cosine similarity between unseen_words and the train_words without having to re-calculate the train_words.

palaiTB commented 1 year ago

Thanks for elaborating !

Okay, I reckon I should wait for the process to complete then. It's been 239 mins as of now.

My use case is to basically match a set of unseen_keywords against a catalogue which is understandably huge.

palaiTB commented 1 year ago

Update : Installing polyfuzz with [fast] finally yielded results. The cell took ~5 hours to complete execution and after that it is able to transform 20k rows of unseen_words within 30 seconds which is wicked quick. Thanks for helping out.

I noticed that the similarity score has somewhat of a threshold at 0.75 for tf-idf. So all the scores are either between 0.75 to 1 or they are 0. I don't see scores in the range [0, 0.75]. I'm using n-gram of (3,4). Is that the reason?

So, a lot of the unseen_words are mapped to None having similarity score 0 which is somewhat wrong because there are some words in the trained_words that have similar sequences.

Ideally we should be seeing scores ranging from 0 to 1.

palaiTB commented 1 year ago

Final Update : On using simply, Polyfuzz('TF-IDF') the results are encouraging and I can see scores from 0-1 range. So preferably, there's no need to tinker around with the ngram ranges.

Thanks for all the support @MaartenGr ! Cheers 🙌

MaartenGr commented 1 year ago

I noticed that the similarity score has somewhat of a threshold at 0.75 for tf-idf. So all the scores are either between 0.75 to 1 or they are 0. I don't see scores in the range [0, 0.75]. I'm using n-gram of (3,4). Is that the reason?

The reason for this lies in one of the parameters, namely min_similarity. That value is set to 0.75 as a default to reduce the necessary memory. Setting that value to 0 should also resolve the issue.