Top n similar tokens APIs

alexcoca commented 4 years ago

Feature description

Hi Guys,

A few issues have discussed the problem of retrieving synonyms (e.g., #276, #1561, #1018). The solution presented in #276:

>>> def most_similar(word):
...   queries = [w for w in word.vocab if w.is_lower == word.is_lower and w.prob >= -15]
...   by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)
...   return by_similarity[:10]

is quite a bit slower than I'd like for my use case.

An alternative that was proposed (in #276) is to retrieve the vectors using something like:

tvec = nlp("king")[0].vector
ms = nlp.vocab.vectors.most_similar(tvec.reshape(1,tvec.shape[0]))

However, this seems very limited as it only returns the most similar word by cosine similarity. Ideally I would be able to specify something like

ms = nlp.vocab.vectors.most_similar(tvec.reshape(1,tvec.shape[0]), n=500)

and receive a list of, ideally POS tag + surface form of the word. The reason for this is that I need to select the most similar top_n words with the same POS tag - I need the actual word, not the vector representation. Is there an elegant (and most importantly efficient) way to achieve this in spaCy or am I stuck with the basic version outlined at the start of this post?

Many thanks. __

adrianeboyd commented 4 years ago

It sounds like you're looking for something like sense2vec: https://explosion.ai/blog/sense2vec-reloaded

That said, support for more than one most similar vector with most_similar(n=10) was added in v2.2.1. Use v2.2.3 because there were some minor bugfixes afterwards.

You can get a list of the n most similar vectors:

ms = nlp.vocab.vectors.most_similar(numpy.asarray([nlp.vocab.vectors[nlp.vocab.strings['king']]]), n=10)
[nlp.vocab.strings[w] for w in ms[0][0]]

With en_core_web_md the output is:

['king', 'KIng', 'kings', 'SULTANS', 'COMMONER', 'prince', 'queen', 'QUEEN', 'PRETENDER', 'Throne']

The default vectors don't have any POS information associated with them (or any notion of POS when they're calculated), so filtering by POS is not really possible. You can approximate it a bit, but it's not going to be very good. For many words, the POS varies by context, so tagging individual words from the most similar results isn't going to work that well.

Be aware that running nlp() on single words to get the POS may not work very well, see: https://github.com/explosion/spaCy/issues/3052#issuecomment-545749206 (v2.2 models work better on noisier data, but at the expense of not being able to tell proper nouns from common nouns without context.)

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy

Top n similar tokens APIs #4741

Feature description