gandersen101 / spaczz

Fuzzy matching and more functionality for spaCy.
MIT License
249 stars 27 forks source link

Possible infinite loop #44

Closed brunobg closed 3 years ago

brunobg commented 3 years ago

Running my tests with spaczz@master they seem to get into an infinite loop at the nlp() call. Stack dumps:

  File "/usr/lib64/python3.8/site-packages/spacy/language.py", line 445, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))
  File "/usr/lib64/python3.8/site-packages/spaczz/pipeline/spaczzruler.py", line 150, in __call__
    for fuzzy_match in self.fuzzy_matcher(doc):
  File "/usr/lib64/python3.8/site-packages/spaczz/matcher/_phrasematcher.py", line 103, in __call__
    matches_wo_label = self._searcher.match(doc, pattern, **kwargs)
  File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 133, in match
    matches_w_nones = [
  File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 134, in <listcomp>
    self._optimize(
  File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 217, in _optimize
    r = self.compare(query, doc[bp_l:bp_r], *args, **kwargs)
  File "doc.pyx", line 308, in spacy.tokens.doc.Doc.__getitem__
  File "/usr/lib64/python3.8/site-packages/spacy/util.py", line 491, in normalize_slice
    if not (step is None or step == 1):

another ctrl-c during another run:

   self._doc = nlp(text)
  File "/usr/lib64/python3.8/site-packages/spacy/language.py", line 445, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))
  File "/usr/lib64/python3.8/site-packages/spaczz/pipeline/spaczzruler.py", line 150, in __call__
    for fuzzy_match in self.fuzzy_matcher(doc):
  File "/usr/lib64/python3.8/site-packages/spaczz/matcher/_phrasematcher.py", line 103, in __call__
    matches_wo_label = self._searcher.match(doc, pattern, **kwargs)
  File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 133, in match
    matches_w_nones = [
  File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 134, in <listcomp>
    self._optimize(
  File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 205, in _optimize
    rl = self.compare(query, doc[p_l : p_r - f], *args, **kwargs)
  File "/usr/lib64/python3.8/site-packages/spaczz/search/fuzzysearcher.py", line 109, in compare
    return round(self._fuzzy_funcs.get(fuzzy_func)(a_text, b_text))
gandersen101 commented 3 years ago

Hi @brunobg, this is concerning but hard to diagnose when the information at hand. If there is any way you could pinpoint what pattern(s)/doc(s) combinations are causing this that would be extremely helpful. Spaczz is well coverage tested and I have used it on the job on medical texts but new issues will always come up as people apply spaczz in new settings.

One thing to keep in mind is that spaczz can be extremely slow given a large enough pattern list and document(s). I explain why this is and why it is beyond my capabilities to significantly speed up spaczz in the short-term in issue #20. Not saying that is what is happening here but keep that in mind as well.

brunobg commented 3 years ago

This happens only in one specific test, so I can probably isolate the pattern like I did before. It has been "fast enough" on every other test, which is why I think it's an infinite loop. Other tests take milliseconds, this one is still going after 10 seconds. Speed is not an issue for me within reasonable times.

I read #20 and it makes sense to me (though running it through a profiler would help to pinpoint where exact it takes too long).

brunobg commented 3 years ago

Closing this. You're right, it just takes long (~100 time longer than scrapy NER).