alvations / pywsd

Python Implementations of Word Sense Disambiguation (WSD) Technologies.
MIT License
744 stars 132 forks source link

IndexError when using disambiguate() with maxsim algorithm #59

Open kientuongnguyen opened 5 years ago

kientuongnguyen commented 5 years ago

I'm using Google Colab

s = "would sentiment" disambiguate(s, algorithm=maxsim, similarity_option='path', keepLemmas=True)

the same with "may sentiment", "might sentiment", "must sentiment", ...


IndexError Traceback (most recent call last)

in () 1 s = "would sentiment" ----> 2 disambiguate(s, algorithm=maxsim, similarity_option='path', keepLemmas=True) 1 frames /usr/local/lib/python3.6/dist-packages/pywsd/allwords_wsd.py in disambiguate(sentence, algorithm, context_is_lemmatized, similarity_option, keepLemmas, prefersNone, from_cache, tokenizer) 43 synset = algorithm(lemma_sentence, lemma, from_cache=from_cache) 44 elif algorithm == max_similarity: ---> 45 synset = algorithm(lemma_sentence, lemma, pos=pos, option=similarity_option) 46 else: 47 synset = algorithm(lemma_sentence, lemma, pos=pos, context_is_lemmatized=True, /usr/local/lib/python3.6/dist-packages/pywsd/similarity.py in max_similarity(context_sentence, ambiguous_word, option, lemma, context_is_lemmatized, pos, best) 125 result = sorted([(v,k) for k,v in result.items()],reverse=True) 126 --> 127 return result[0][1] if best else result IndexError: list index out of range
BigBossAnwer commented 4 years ago

Getting the same error with a similar kind of usage in python 3.8, pywsd 1.2.4 for example:

disambiguate('Neither was there a qualified majority within this House to revert to Article 272.', max_similarity, similarity_option='path')

Gives an index out of bounds error in pywsd.similarity.max_similarity()

Scotch-tape patch with:

def max_similarity(context_sentence: str, ambiguous_word: str, option="path",
                   lemma=True, context_is_lemmatized=False, pos=None, best=True) -> "wn.Synset":
    """
    Perform WSD by maximizing the sum of maximum similarity between possible
    synsets of all words in the context sentence and the possible synsets of the
    ambiguous words (see https://ibin.co/4gG9zUlejUUA.png):
    {argmax}_{synset(a)}(\sum_{i}^{n}{{max}_{synset(i)}(sim(i,a))}

    :param context_sentence: String, a sentence.
    :param ambiguous_word: String, a single word.
    :return: If best, returns only the best Synset, else returns a dict.
    """
    ambiguous_word = lemmatize(ambiguous_word)
    # If ambiguous word not in WordNet return None
    if not wn.synsets(ambiguous_word):
        return None
    if context_is_lemmatized:
        context_sentence = word_tokenize(context_sentence)
    else:
        context_sentence = [lemmatize(w) for w in word_tokenize(context_sentence)]
    result = {}
    for i in wn.synsets(ambiguous_word, pos=pos):
        result[i] = 0
        for j in context_sentence:
            _result = [0]
            for k in wn.synsets(j):
                _result.append(sim(i,k,option))
            result[i] += max(_result)

    if option in ["res","resnik"]: # lower score = more similar
        result = sorted([(v,k) for k,v in result.items()])
    else: # higher score = more similar
        result = sorted([(v,k) for k,v in result.items()],reverse=True)

    if not len(result):
        return None

    return result[0][1] if best else result

in pywsd.similarity where

    if not len(result):
        return None

is the "fix"

Doesn't really resolve the underlying issue though

tcardlab commented 1 year ago

I was also getting this error. I found that it was because the incorrect pos was being passed to max_similarity.

Why the wrong pos is being passed probably has to do with something in the following chain: disambiguate > lemmatize_sentence > postagger & lemmatize


However, we can still catch a bad pos by checking if the synset list is empty (falsy) and use an unspecified pos otherwise. I have done this at the declaration of syn:

from pywsd.tokenize import word_tokenize
from pywsd.utils import lemmatize
from pywsd import sim

def max_similarity_fix(context_sentence: str, ambiguous_word: str, option="path",
                   lemma=True, context_is_lemmatized=False, pos=None, best=True, from_cache=False) -> "wn.Synset":
    """
    Perform WSD by maximizing the sum of maximum similarity between possible
    synsets of all words in the context sentence and the possible synsets of the
    ambiguous words (see https://ibin.co/4gG9zUlejUUA.png):
    {argmax}_{synset(a)}(\sum_{i}^{n}{{max}_{synset(i)}(sim(i,a))}

    :param context_sentence: String, a sentence.
    :param ambiguous_word: String, a single word.
    :return: If best, returns only the best Synset, else returns a dict.
    """
    ambiguous_word = lemmatize(ambiguous_word)
    syn = wn.synsets(ambiguous_word, pos=pos) or wn.synsets(ambiguous_word)

    # If ambiguous word not in WordNet return None
    if not syn:
        return None
    if context_is_lemmatized:
        context_sentence = word_tokenize(context_sentence)
    else:
        context_sentence = [lemmatize(w) for w in word_tokenize(context_sentence)]

    result = {}
    for i in syn:
        result[i] = 0
        for j in context_sentence:
            _result = [0]
            for k in wn.synsets(j):
                _result.append(sim(i,k,option))
            result[i] += max(_result)

    if option in ["res","resnik"]: # lower score = more similar
        result = sorted([(v,k) for k,v in result.items()])
    else: # higher score = more similar
        result = sorted([(v,k) for k,v in result.items()],reverse=True)

    return result[0][1] if best else result

You can see this works for "should sentiment. deep-water. co-beneficiary.", each of which would otherwise break it:

sentence = "should sentiment. deep-water. co-beneficiary."
print( disambiguate(sentence, algorithm=max_similarity_fix ))


I am uncertain as to whether using an unspecified pos is a good idea. It may be better to mark them with a unique output that you can filter for afterward. @BigBossAnwer has a good method for that, though you may wish to return a different value.