MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.
https://maartengr.github.io/PolyFuzz/
MIT License
725 stars 68 forks source link

from polyfuzz import PolyFuzz #36

Open wdchild opened 2 years ago

wdchild commented 2 years ago

Although I was able to use PolyFuzz once for some of your basic example code, once I tried messing around with Embeddings or Bert, the entire package broke. It seems to have to do with differing numpy version compatibilities. Currently, if I do a basic

pip install polyfuzz followed by

from polyfuzz import PolyFuzz I get the following error.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [63], in <cell line: 1>()
----> 1 from polyfuzz import PolyFuzz

File /opt/conda/envs/vespid/lib/python3.9/site-packages/polyfuzz/__init__.py:1, in <module>
----> 1 from .polyfuzz import PolyFuzz
      2 __version__ = "0.3.2"

File /opt/conda/envs/vespid/lib/python3.9/site-packages/polyfuzz/polyfuzz.py:7, in <module>
      5 from polyfuzz.linkage import single_linkage
      6 from polyfuzz.utils import check_matches, check_grouped, create_logger
----> 7 from polyfuzz.models import TFIDF, RapidFuzz, Embeddings, BaseMatcher
      8 from polyfuzz.metrics import precision_recall_curve, visualize_precision_recall
     10 logger = create_logger()

File /opt/conda/envs/vespid/lib/python3.9/site-packages/polyfuzz/models/__init__.py:4, in <module>
      2 from ._distance import EditDistance
      3 from ._rapidfuzz import RapidFuzz
----> 4 from ._tfidf import TFIDF
      5 from ._utils import cosine_similarity
      7 from polyfuzz.error import NotInstalled

File /opt/conda/envs/vespid/lib/python3.9/site-packages/polyfuzz/models/_tfidf.py:7, in <module>
      4 from typing import List, Tuple
      5 from sklearn.feature_extraction.text import TfidfVectorizer
----> 7 from ._utils import cosine_similarity
      8 from ._base import BaseMatcher
     11 class TFIDF(BaseMatcher):

File /opt/conda/envs/vespid/lib/python3.9/site-packages/polyfuzz/models/_utils.py:9, in <module>
      6 from sklearn.metrics.pairwise import cosine_similarity as scikit_cosine_similarity
      8 try:
----> 9     from sparse_dot_topn import awesome_cossim_topn
     10     _HAVE_SPARSE_DOT = True
     11 except ImportError:

File /opt/conda/envs/vespid/lib/python3.9/site-packages/sparse_dot_topn/__init__.py:5, in <module>
      2 import sys
      4 if sys.version_info[0] >= 3:
----> 5     from sparse_dot_topn.awesome_cossim_topn import awesome_cossim_topn
      6 else:
      7     from awesome_cossim_topn import awesome_cossim_topn

File /opt/conda/envs/vespid/lib/python3.9/site-packages/sparse_dot_topn/awesome_cossim_topn.py:7, in <module>
      4 from scipy.sparse import isspmatrix_csr
      6 if sys.version_info[0] >= 3:
----> 7     from sparse_dot_topn import sparse_dot_topn as ct
      8     from sparse_dot_topn import sparse_dot_topn_threaded as ct_thread
      9 else:

File /opt/conda/envs/vespid/lib/python3.9/site-packages/sparse_dot_topn/sparse_dot_topn.pyx:1, in init sparse_dot_topn.sparse_dot_topn()

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

Following some StackOverflow posts, I tried installing differing versions of numpy, but in the end, something is always unhappy, and somehow I can no longer use PolyFuzz no matter what I do. It would be great if it would work with the latest version of numpy, or if at least one version definitely worked reliably! Thanks for looking into this.

wdchild commented 2 years ago

I eventually got this working by reinstalling hdbscan! Very strange.

MaartenGr commented 2 years ago

I eventually got this working by reinstalling hdbscan! Very strange.

Glad to hear that it worked out! This used to be an issue with versions <0.28.0 of HDBSCAN as it did not use oldest-supported-numpy before to match ABI. Making sure you have the newest version of HDBSCAN, also in future instances, will prevent this.