MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.
https://maartengr.github.io/PolyFuzz/
MIT License
733 stars 67 forks source link

Extreme memory use in TFIDF matcher in 0.3 #20

Closed adamskogman closed 3 years ago

adamskogman commented 3 years ago
tfidf = TFIDF(n_gram_range=(3, 3), model_id='normalize', min_similarity=0.94)
model = PolyFuzz(tfidf)
model.match(source_names, target_names)

Using polyfuzz[fast]

source is ~2500 names, target is ~1.5 million names.

In 0.2.2, this runs well, and the memory size of Python goes from 1 GB to 1.5 GB during this operation.

In 0.3, the memory of the process goes up to 40GB (fourty!).

Yes, I have a big laptop. But this fails in Docker, which has a hard cap of 4 GB in our case.

MaartenGr commented 3 years ago

I think it is trying to go from the sparse TF-IDF matrix to a normal numpy array which makes the memory usage explode:

https://github.com/MaartenGr/PolyFuzz/blob/a60dfc62e9b31e188490ebabd1c9481d58f7732b/polyfuzz/models/_utils.py#L88-L89

For now, I would suggest either using 0.2.2 or use method="knn" or method="sklearn".

I will try to see if it is possible to fix this issue whilst still keeping it possible to select a top_n.

MaartenGr commented 3 years ago

@adamskogman Just updated PolyFuzzy to 0.3.2 which should fix your issue. Let me know if it does not work!