Open kientuongnguyen opened 5 years ago
Getting the same error with a similar kind of usage in python 3.8, pywsd 1.2.4 for example:
disambiguate('Neither was there a qualified majority within this House to revert to Article 272.', max_similarity, similarity_option='path')
Gives an index out of bounds error in pywsd.similarity.max_similarity()
Scotch-tape patch with:
def max_similarity(context_sentence: str, ambiguous_word: str, option="path",
lemma=True, context_is_lemmatized=False, pos=None, best=True) -> "wn.Synset":
"""
Perform WSD by maximizing the sum of maximum similarity between possible
synsets of all words in the context sentence and the possible synsets of the
ambiguous words (see https://ibin.co/4gG9zUlejUUA.png):
{argmax}_{synset(a)}(\sum_{i}^{n}{{max}_{synset(i)}(sim(i,a))}
:param context_sentence: String, a sentence.
:param ambiguous_word: String, a single word.
:return: If best, returns only the best Synset, else returns a dict.
"""
ambiguous_word = lemmatize(ambiguous_word)
# If ambiguous word not in WordNet return None
if not wn.synsets(ambiguous_word):
return None
if context_is_lemmatized:
context_sentence = word_tokenize(context_sentence)
else:
context_sentence = [lemmatize(w) for w in word_tokenize(context_sentence)]
result = {}
for i in wn.synsets(ambiguous_word, pos=pos):
result[i] = 0
for j in context_sentence:
_result = [0]
for k in wn.synsets(j):
_result.append(sim(i,k,option))
result[i] += max(_result)
if option in ["res","resnik"]: # lower score = more similar
result = sorted([(v,k) for k,v in result.items()])
else: # higher score = more similar
result = sorted([(v,k) for k,v in result.items()],reverse=True)
if not len(result):
return None
return result[0][1] if best else result
in pywsd.similarity where
if not len(result):
return None
is the "fix"
Doesn't really resolve the underlying issue though
I was also getting this error. I found that it was because the incorrect pos
was being passed to max_similarity
.
Why the wrong pos
is being passed probably has to do with something in the following chain:
disambiguate > lemmatize_sentence > postagger & lemmatize
However, we can still catch a bad pos
by checking if the synset list is empty (falsy) and use an unspecified pos
otherwise.
I have done this at the declaration of syn
:
from pywsd.tokenize import word_tokenize
from pywsd.utils import lemmatize
from pywsd import sim
def max_similarity_fix(context_sentence: str, ambiguous_word: str, option="path",
lemma=True, context_is_lemmatized=False, pos=None, best=True, from_cache=False) -> "wn.Synset":
"""
Perform WSD by maximizing the sum of maximum similarity between possible
synsets of all words in the context sentence and the possible synsets of the
ambiguous words (see https://ibin.co/4gG9zUlejUUA.png):
{argmax}_{synset(a)}(\sum_{i}^{n}{{max}_{synset(i)}(sim(i,a))}
:param context_sentence: String, a sentence.
:param ambiguous_word: String, a single word.
:return: If best, returns only the best Synset, else returns a dict.
"""
ambiguous_word = lemmatize(ambiguous_word)
syn = wn.synsets(ambiguous_word, pos=pos) or wn.synsets(ambiguous_word)
# If ambiguous word not in WordNet return None
if not syn:
return None
if context_is_lemmatized:
context_sentence = word_tokenize(context_sentence)
else:
context_sentence = [lemmatize(w) for w in word_tokenize(context_sentence)]
result = {}
for i in syn:
result[i] = 0
for j in context_sentence:
_result = [0]
for k in wn.synsets(j):
_result.append(sim(i,k,option))
result[i] += max(_result)
if option in ["res","resnik"]: # lower score = more similar
result = sorted([(v,k) for k,v in result.items()])
else: # higher score = more similar
result = sorted([(v,k) for k,v in result.items()],reverse=True)
return result[0][1] if best else result
You can see this works for "should sentiment. deep-water. co-beneficiary."
, each of which would otherwise break it:
sentence = "should sentiment. deep-water. co-beneficiary."
print( disambiguate(sentence, algorithm=max_similarity_fix ))
I am uncertain as to whether using an unspecified pos
is a good idea. It may be better to mark them with a unique output that you can filter for afterward. @BigBossAnwer has a good method for that, though you may wish to return a different value.
I'm using Google Colab
s = "would sentiment" disambiguate(s, algorithm=maxsim, similarity_option='path', keepLemmas=True)
the same with "may sentiment", "might sentiment", "must sentiment", ...
IndexError Traceback (most recent call last)