gandersen101 / spaczz

Fuzzy matching and more functionality for spaCy.
MIT License
252 stars 27 forks source link

IndexError: [E201] Span index out of range. #40

Closed aravind-chilakamarri closed 3 years ago

aravind-chilakamarri commented 3 years ago

fuzzy matcher unable to process matcher(doc). It was working 48 hours back. It's not working now.

File "xxxxxxxxxxxxxxxxxxxxxxxxxx", line 46, in pattern_matcher matched_by_fuzzy_phrase = matcher_fuzzy(doc) File "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/lib/python3.8/site-packages/spaczz/matcher/fuzzymatcher.py", line 105, in call matches_wo_label = self.match(doc, pattern, **kwargs) File "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/lib/python3.8/site-packages/spaczz/fuzz/fuzzysearcher.py", line 216, in match matches_w_nones = [ File "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/lib/python3.8/site-packages/spaczz/fuzz/fuzzysearcher.py", line 217, in self._adjust_left_right_positions( File "/home/aravind/nlu_endpoint/NLUSQL_ENV3/lib/python3.8/site-packages/spaczz/fuzz/fuzzysearcher.py", line 326, in _adjust_left_right_positions r = self.compare(query.text, doc[bp_l:bp_r].text, fuzzy_func, ignore_case) File "span.pyx", line 503, in spacy.tokens.span.Span.text.get

File "span.pyx", line 190, in spacy.tokens.span.Span.getitem

IndexError: [E201] Span index out of range.

gandersen101 commented 3 years ago

Hi @ArvndSpidy, sorry you have run into this error. I could use some more information from you to better diagnose this problem.

  1. Could you provide the code snippet, doc text, and pattern(s) that are producing this error?
  2. Are you using the PyPi package of spaczz v0.3.1?
  3. Could you provide your virtualenv details (i.e. Python/package versions)?

Nothing has changed in the version of spaczz available on PyPi for months so something likely changed in the inputs you are using and/or with the other packages in your environment.

It is very possible there is a bug in spaczz but it is hard for me to diagnose it without those additional details listed above.

Thanks!

gandersen101 commented 3 years ago

Hi @ArvndSpidy if I don't hear back from you in the next few days I'm going to go ahead and close this issue. If it pops up again please feel free to open another issue and provide the details I asked for above. Thanks!

brunobg commented 3 years ago

Not OP, but I can reproduce this. Here's the info.

Snippet: the same code from the tutorial, but I'm using the pt model. I don't know if that makes a difference.

    nlp = spacy.load("pt_core_news_sm")
    spaczzRuler = SpaczzRuler(nlp)
    spaczzRuler.add_patterns(fuzzypatterns)
    nlp.add_pipe(spaczzRuler, before='ner')
    nlp.to_disk(SPACY_PATH)

where fuzzypatters is a list like this:

fuzzypatterns.append({
    "label": mylabel, # string, using 2 labels here
    "pattern": a, # the word
    "type": "fuzzy",
    "id": a 
})

Words contain spaces and non-ascii characters, 1071 items, but see more below.

Versions:

Successfully installed rapidfuzz-0.14.2 spaczz-0.3.1

Python 3.8.6 spacy==2.3.5 spacy-lookups-data==0.3.2

This is linked to the contents of the patterns and the parsed phrase. I could isolate a combination that reproduces the issue. A simple test with this pattern fails:

[{'label': 'WGRP', 'pattern': 'Huxelrebe', 'type': 'fuzzy', 'id': 'Huxelrebe'}, 
{'label': 'WGRP', 'pattern': 'Courtillier Musqué', 'type': 'fuzzy', 'id': 'Courtillier Musqué'}]

when parsing this phrase: trabalho, investimento e escolhas corajosas,.

Here's the stack dump:

Traceback (most recent call last):
  File "xxxx/tokenizer.py", line 17, in __init__
    self._doc = nlp(text)
  File "/usr/lib64/python3.8/site-packages/spacy/language.py", line 445, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))
  File "/usr/lib64/python3.8/site-packages/spaczz/pipeline/spaczzruler.py", line 159, in __call__
    for fuzzy_match in self.fuzzy_matcher(doc):
  File "/usr/lib64/python3.8/site-packages/spaczz/matcher/fuzzymatcher.py", line 105, in __call__
    matches_wo_label = self.match(doc, pattern, **kwargs)
  File "/usr/lib64/python3.8/site-packages/spaczz/fuzz/fuzzysearcher.py", line 216, in match
    matches_w_nones = [
  File "/usr/lib64/python3.8/site-packages/spaczz/fuzz/fuzzysearcher.py", line 217, in <listcomp>
    self._adjust_left_right_positions(
  File "/usr/lib64/python3.8/site-packages/spaczz/fuzz/fuzzysearcher.py", line 326, in _adjust_left_right_positions
    r = self.compare(query.text, doc[bp_l:bp_r].text, fuzzy_func, ignore_case)
  File "span.pyx", line 503, in spacy.tokens.span.Span.text.__get__
  File "span.pyx", line 190, in spacy.tokens.span.Span.__getitem__
IndexError: [E201] Span index out of range.

Thanks for the great work. Looking forward to a patch.

brunobg commented 3 years ago

Wild guess, I think the bug is here: https://github.com/gandersen101/spaczz/blob/v0.3.1/src/spaczz/fuzz/fuzzysearcher.py#L319. I think it should be < instead of <=.

I notice master is quite different from 0.3.1. Are you releasing a new version soon? Is master stable?

brunobg commented 3 years ago

Actually, debugging it here, the problem happens when bp_l == bp_r, not when it's out of bounds.

Suggestion for a fix:

        if bp_l == bp_r:
                return None
        r = self.compare(query.text, doc[bp_l:bp_r].text, fuzzy_func, ignore_case)
gandersen101 commented 3 years ago

Hi @brunobg, thank you for putting in the work to track this bug down!

I will dig into this more and hopefully have a fix ready soon.

To address your other questions, I am hoping to have a new release ready in the next couple days (adding a token based matcher) and I will hopefully have this bug addressed with that release as well.

The master should always be stable, but now that I am getting more requests for spaczz, I should probably create a dev branch for accumulating future release changes.

gandersen101 commented 3 years ago

Hi again @brunobg. Pull #43 addresses this bug and it will be part of spaczz's 0.4.0 release, which I hope to have ready by the end of my day tomorrow. I will close this issue with the aforementioned release.

Your proposed fix to this bug was a viable solution so thank you again! I extended your fix a little (probably overkill) because I'm not entirely sure where in the optimization bp_l can equal bp_r, which I thought I had accounted for. I implemented your fix as:

if bp_l >= bp_r or bp_r <= bp_l:
    return None
else:
    r = self.compare(query, doc[bp_l:bp_r], *args, **kwargs)
    if r >= min_r2:
        return (bp_l, bp_r, r)
    else:
        return None

Best.

brunobg commented 3 years ago

Thanks! I was glad to help. I can't close this since I'm not OP. I'll test your release as soon as it comes out.

brunobg commented 3 years ago

Actually, this is happening on master too:

  File "/home/corollarium/git/Corollarium/vinarium/bebum/bebum/tokenizer.py", line 20, in __init__
    self._doc = nlp(text)
  File "/usr/lib64/python3.8/site-packages/spacy/language.py", line 445, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))
  File "/usr/lib64/python3.8/site-packages/spaczz/pipeline/spaczzruler.py", line 150, in __call__
    for fuzzy_match in self.fuzzy_matcher(doc):
  File "/usr/lib64/python3.8/site-packages/spaczz/matcher/_phrasematcher.py", line 103, in __call__
    matches_wo_label = self._searcher.match(doc, pattern, **kwargs)
  File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 130, in match
    match_values = self._scan(doc, query, min_r1, *args, **kwargs)
  File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 256, in _scan
    match = self.compare(query, doc[i : i + len(query)], *args, **kwargs)
  File "/usr/lib64/python3.8/site-packages/spaczz/search/fuzzysearcher.py", line 105, in compare
    b_text = b.text.lower()
  File "span.pyx", line 503, in spacy.tokens.span.Span.text.__get__
  File "span.pyx", line 190, in spacy.tokens.span.Span.__getitem__
IndexError: [E201] Span index out of range.
brunobg commented 3 years ago

Fix on spaczz/search/_phrasesearcher.py", line 256:

match_values: Dict[int, int] = dict()
        i = 0
+        if not len(query):
+            return None
        while i + len(query) <= len(doc):
gandersen101 commented 3 years ago

Hey @brunobg, looks like the above is what happens with an empty string. Easy fix. Thanks for catching this.

gandersen101 commented 3 years ago

This is closed by spaczz v0.4.0. Hopefully you all enjoy it. Please raise an issue if you run into any more bugs!