Closed palaiTB closed 1 year ago
Which version of PolyFuzz are you currently using?
Also, it might be worthwhile to use the fit
and transform
combo in order to reduce the necessary memory. You can find a bit more about that here. Simply fit on a subset of the data and then iteratively transform on chunks of the data.
I'm using the latest one available : 0.3.2
In that case, using the fit
and transform
combo should reduce the memory errors that you have.
One other thing, the latest one available is actually 0.4.0
. I do not believe there are any performance fixes there with respect to the issue you are having but it might be worthwhile to try it out.
Alright. Using the latest version didn't work out. I'll try the fit and transform combo and update you.
RuntimeError: nnz of the result is too large
My train_words list is ~1M and I'm getting this error when trying to fit the model.
Make sure that the train_words
list is a subset of your data when using .fit
. For example, 100.000 words or even less. Then, when you apply .train
you can pass the words in increments. For example, sets of 50.000 words.
I tried to fit it on a dataset ~50k rows and here's the error :
MemoryError: Unable to allocate 18.6 GiB for an array with shape (50000, 50000) and data type float64.
Could you share your exact code for doing this? It makes it a bit easier to see where the issue stems from if I can see how the model is created and fitted.
It's pretty straightforward :
from polyfuzz import PolyFuzz
model = PolyFuzz("TF-IDF")
train_words = df['column_name'].head(50000).astype('U').tolist()
model.fit(train_words)
I'm converting a dataframe column to list and then using it as train_words. Shouldn't be an issue in my opinion
Did you install PolyFuzz through pip install polyfuzz[fast]
? It generally uses less memory and often prevents these issues.
Something else you can try is the following:
from polyfuzz import PolyFuzz
model = PolyFuzz("TF-IDF")
train_words = df['column_name'].head(50000).astype('U').tolist()
model.fit(["empty doc"], train_words)
That way, the resulting matrix is 1x50000 instead of 50000x50000.
Alright. I installed through pip install polyfuzz==0.4.0
I'll try to use [fast]. I'll try out that snippet.
Thanks !
Make sure that the
train_words
list is a subset of your data when using.fit
. For example, 100.000 words or even less. Then, when you apply.train
you can pass the words in increments. For example, sets of 50.000 words.
Erm, AttributeError: 'PolyFuzz' object has no attribute 'train'
I couldn't find .train
in the docs
Oops, I meant the .transform
function. Having said that, I think [fast] should already work quite a bit so maybe you wouldn't need the fit transform steps.
Okay, thanks but [fast] hasn't really been that effective.
So the issue is, my train_words is huge right - around 1M records. If I directly fit all 1M records, the step hasn't been able to complete. It's been 167 mins and the cell is still executing. I'm confused as to what you meant by specifying that the train_keyword list is a subset of data. The transform function is to basically a mapping of the unseen_words with the train_words isn't it? Or am I mistaken?
Ideally, I should be passing the unseen_words in increments to the transform function and that's understandable. How do I fit the entire 1M records of train_words?
Alright, I took some time reading through the code and I believe there are two places for potential issues with using sparse matrix representations.
First, it is calculating the TF-IDF representation by itself. If you have 1M documents and for each of those documents you extract its representation then the resulting sparse matrix can be quite large. In your case, if I am not mistaken, seems to be a sparse matrix of size 1M x 1k which is quite large. It gets quite a big bigger due when you cast it to a dense format, which was my initial thought seeing the error you have:
MemoryError: Unable to allocate 91.0 GiB for an array with shape (X, Y) and data type float64.
This used to be an issue with an earlier version of PolyFuzz but now should be fixed.
Second, there is the act of calculating the cosine similarity between these large sparse matrices. Based on the above, the resulting similarity matrix would be 1M x 1M which is generally not possible and requires extreme amounts of RAM. However, the [fast] option is a solution as it does not keep the entire 1M x 1M similarity matrix in memory.
If I directly fit all 1M records, the step hasn't been able to complete. It's been 167 mins and the cell is still executing.
This might actually indicate that the procedure is working correctly. Imagine, for a single word, trying to find all similar words in a list of a million words, and then do that for each of the million words. That procedure can take quite a while and the [fast] option prevents from overloading the memory by trying to create the 1M x 1M similarity matrix.
I'm confused as to what you meant by specifying that the train_keyword list is a subset of data.
Apologies for the confusing, I was typing from my cellphone which understandably was rather short. I meant fitting on a subset of the data, for example 50.000 documents instead of the full 1M documents as there might be a way consider some parts of you data a subset. Typically, users have a smaller list of words that they want the input words to be mapped to and fitting on that smaller list is much faster. I am not familiar with your use case, so that might be a naive approach considering I am not aware of whether you goal involves a much smaller word set that you want the input words to be mapped to.
The transform function is to basically a mapping of the unseen_words with the train_words isn't it? Or am I mistaken?
In essence, yes. It calculates the cosine similarity between unseen_words
and the train_words
without having to re-calculate the train_words
.
Thanks for elaborating !
Okay, I reckon I should wait for the process to complete then. It's been 239 mins as of now.
My use case is to basically match a set of unseen_keywords against a catalogue which is understandably huge.
Update : Installing polyfuzz with [fast] finally yielded results. The cell took ~5 hours to complete execution and after that it is able to transform 20k rows of unseen_words within 30 seconds which is wicked quick. Thanks for helping out.
I noticed that the similarity score has somewhat of a threshold at 0.75 for tf-idf. So all the scores are either between 0.75 to 1 or they are 0. I don't see scores in the range [0, 0.75]. I'm using n-gram of (3,4). Is that the reason?
So, a lot of the unseen_words are mapped to None
having similarity score 0 which is somewhat wrong because there are some words in the trained_words that have similar sequences.
Ideally we should be seeing scores ranging from 0 to 1.
Final Update : On using simply, Polyfuzz('TF-IDF') the results are encouraging and I can see scores from 0-1 range. So preferably, there's no need to tinker around with the ngram ranges.
Thanks for all the support @MaartenGr ! Cheers 🙌
I noticed that the similarity score has somewhat of a threshold at 0.75 for tf-idf. So all the scores are either between 0.75 to 1 or they are 0. I don't see scores in the range [0, 0.75]. I'm using n-gram of (3,4). Is that the reason?
The reason for this lies in one of the parameters, namely min_similarity
. That value is set to 0.75 as a default to reduce the necessary memory. Setting that value to 0 should also resolve the issue.
tfidf = TFIDF(n_gram_range=(3, 3)) model = PolyFuzz(tfidf)
MemoryError: Unable to allocate 91.0 GiB for an array with shape (X, Y) and data type float64.
X ~ 10k rows y ~ 1M rows
Is there any way to fix it? I've tried changing the pagefile size limit but in vain.