Closed aravind-chilakamarri closed 3 years ago
Hi @ArvndSpidy, sorry you have run into this error. I could use some more information from you to better diagnose this problem.
Nothing has changed in the version of spaczz available on PyPi for months so something likely changed in the inputs you are using and/or with the other packages in your environment.
It is very possible there is a bug in spaczz but it is hard for me to diagnose it without those additional details listed above.
Thanks!
Hi @ArvndSpidy if I don't hear back from you in the next few days I'm going to go ahead and close this issue. If it pops up again please feel free to open another issue and provide the details I asked for above. Thanks!
Not OP, but I can reproduce this. Here's the info.
Snippet: the same code from the tutorial, but I'm using the pt model. I don't know if that makes a difference.
nlp = spacy.load("pt_core_news_sm")
spaczzRuler = SpaczzRuler(nlp)
spaczzRuler.add_patterns(fuzzypatterns)
nlp.add_pipe(spaczzRuler, before='ner')
nlp.to_disk(SPACY_PATH)
where fuzzypatters is a list like this:
fuzzypatterns.append({
"label": mylabel, # string, using 2 labels here
"pattern": a, # the word
"type": "fuzzy",
"id": a
})
Words contain spaces and non-ascii characters, 1071 items, but see more below.
Versions:
Successfully installed rapidfuzz-0.14.2 spaczz-0.3.1
Python 3.8.6 spacy==2.3.5 spacy-lookups-data==0.3.2
This is linked to the contents of the patterns and the parsed phrase. I could isolate a combination that reproduces the issue. A simple test with this pattern fails:
[{'label': 'WGRP', 'pattern': 'Huxelrebe', 'type': 'fuzzy', 'id': 'Huxelrebe'},
{'label': 'WGRP', 'pattern': 'Courtillier Musqué', 'type': 'fuzzy', 'id': 'Courtillier Musqué'}]
when parsing this phrase: trabalho, investimento e escolhas corajosas,
.
Here's the stack dump:
Traceback (most recent call last):
File "xxxx/tokenizer.py", line 17, in __init__
self._doc = nlp(text)
File "/usr/lib64/python3.8/site-packages/spacy/language.py", line 445, in __call__
doc = proc(doc, **component_cfg.get(name, {}))
File "/usr/lib64/python3.8/site-packages/spaczz/pipeline/spaczzruler.py", line 159, in __call__
for fuzzy_match in self.fuzzy_matcher(doc):
File "/usr/lib64/python3.8/site-packages/spaczz/matcher/fuzzymatcher.py", line 105, in __call__
matches_wo_label = self.match(doc, pattern, **kwargs)
File "/usr/lib64/python3.8/site-packages/spaczz/fuzz/fuzzysearcher.py", line 216, in match
matches_w_nones = [
File "/usr/lib64/python3.8/site-packages/spaczz/fuzz/fuzzysearcher.py", line 217, in <listcomp>
self._adjust_left_right_positions(
File "/usr/lib64/python3.8/site-packages/spaczz/fuzz/fuzzysearcher.py", line 326, in _adjust_left_right_positions
r = self.compare(query.text, doc[bp_l:bp_r].text, fuzzy_func, ignore_case)
File "span.pyx", line 503, in spacy.tokens.span.Span.text.__get__
File "span.pyx", line 190, in spacy.tokens.span.Span.__getitem__
IndexError: [E201] Span index out of range.
Thanks for the great work. Looking forward to a patch.
Wild guess, I think the bug is here: https://github.com/gandersen101/spaczz/blob/v0.3.1/src/spaczz/fuzz/fuzzysearcher.py#L319. I think it should be <
instead of <=
.
I notice master is quite different from 0.3.1. Are you releasing a new version soon? Is master stable?
Actually, debugging it here, the problem happens when bp_l == bp_r
, not when it's out of bounds.
Suggestion for a fix:
if bp_l == bp_r:
return None
r = self.compare(query.text, doc[bp_l:bp_r].text, fuzzy_func, ignore_case)
Hi @brunobg, thank you for putting in the work to track this bug down!
I will dig into this more and hopefully have a fix ready soon.
To address your other questions, I am hoping to have a new release ready in the next couple days (adding a token based matcher) and I will hopefully have this bug addressed with that release as well.
The master should always be stable, but now that I am getting more requests for spaczz, I should probably create a dev branch for accumulating future release changes.
Hi again @brunobg. Pull #43 addresses this bug and it will be part of spaczz's 0.4.0 release, which I hope to have ready by the end of my day tomorrow. I will close this issue with the aforementioned release.
Your proposed fix to this bug was a viable solution so thank you again! I extended your fix a little (probably overkill) because I'm not entirely sure where in the optimization bp_l can equal bp_r, which I thought I had accounted for. I implemented your fix as:
if bp_l >= bp_r or bp_r <= bp_l:
return None
else:
r = self.compare(query, doc[bp_l:bp_r], *args, **kwargs)
if r >= min_r2:
return (bp_l, bp_r, r)
else:
return None
Best.
Thanks! I was glad to help. I can't close this since I'm not OP. I'll test your release as soon as it comes out.
Actually, this is happening on master too:
File "/home/corollarium/git/Corollarium/vinarium/bebum/bebum/tokenizer.py", line 20, in __init__
self._doc = nlp(text)
File "/usr/lib64/python3.8/site-packages/spacy/language.py", line 445, in __call__
doc = proc(doc, **component_cfg.get(name, {}))
File "/usr/lib64/python3.8/site-packages/spaczz/pipeline/spaczzruler.py", line 150, in __call__
for fuzzy_match in self.fuzzy_matcher(doc):
File "/usr/lib64/python3.8/site-packages/spaczz/matcher/_phrasematcher.py", line 103, in __call__
matches_wo_label = self._searcher.match(doc, pattern, **kwargs)
File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 130, in match
match_values = self._scan(doc, query, min_r1, *args, **kwargs)
File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 256, in _scan
match = self.compare(query, doc[i : i + len(query)], *args, **kwargs)
File "/usr/lib64/python3.8/site-packages/spaczz/search/fuzzysearcher.py", line 105, in compare
b_text = b.text.lower()
File "span.pyx", line 503, in spacy.tokens.span.Span.text.__get__
File "span.pyx", line 190, in spacy.tokens.span.Span.__getitem__
IndexError: [E201] Span index out of range.
Fix on spaczz/search/_phrasesearcher.py", line 256
:
match_values: Dict[int, int] = dict()
i = 0
+ if not len(query):
+ return None
while i + len(query) <= len(doc):
Hey @brunobg, looks like the above is what happens with an empty string. Easy fix. Thanks for catching this.
This is closed by spaczz v0.4.0. Hopefully you all enjoy it. Please raise an issue if you run into any more bugs!
fuzzy matcher unable to process matcher(doc). It was working 48 hours back. It's not working now.
File "xxxxxxxxxxxxxxxxxxxxxxxxxx", line 46, in pattern_matcher matched_by_fuzzy_phrase = matcher_fuzzy(doc) File "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/lib/python3.8/site-packages/spaczz/matcher/fuzzymatcher.py", line 105, in call matches_wo_label = self.match(doc, pattern, **kwargs) File "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/lib/python3.8/site-packages/spaczz/fuzz/fuzzysearcher.py", line 216, in match matches_w_nones = [ File "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/lib/python3.8/site-packages/spaczz/fuzz/fuzzysearcher.py", line 217, in
self._adjust_left_right_positions(
File "/home/aravind/nlu_endpoint/NLUSQL_ENV3/lib/python3.8/site-packages/spaczz/fuzz/fuzzysearcher.py", line 326, in _adjust_left_right_positions
r = self.compare(query.text, doc[bp_l:bp_r].text, fuzzy_func, ignore_case)
File "span.pyx", line 503, in spacy.tokens.span.Span.text.get
File "span.pyx", line 190, in spacy.tokens.span.Span.getitem
IndexError: [E201] Span index out of range.