Closed shueffner closed 8 years ago
If you have a look at the semantic relatedness produced by this model: http://sense2vec.spacy.io , would these results be sufficient for you?
We don't have this integrated into spaCy yet. But that's the plan. For now you could use the built-in word vectors. The following function is relatively slow. You should probably iterate over the vocab and cache all the results.
>>> def most_similar(word):
... by_similarity = sorted(word.vocab, key=lambda w: word.similarity(w), reverse=True)
... return [w.orth_ for w in by_similarity[:10]]
...
>>> most_similar(nlp.vocab[u'dog'])
[u'dog', u'Dog', u'DOG', u'DoG', u'doG', u'cat', u'Cat', u'CAT', u'dogs', u'Dogs']
>>> most_similar(nlp.vocab[u'scrape'])
[u'scrape', u'Scrape', u'SCRAPE', u'rustle', u'Rustle', u'RUSTLE', u'gouge', u'Gouge', u'GOUGE', u'gnaw']
Looking at these results, it'd be nice to make it a bit case sensitive. We should also exclude rare terms:
>>> def most_similar(word):
... queries = [w for w in word.vocab if w.is_lower == word.is_lower and w.prob >= -15]
... by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)
... return by_similarity[:10]
...
>>> [w.lower_ for w in most_similar(nlp.vocab[u'dog'])]
[u'dog', u'cat', u'dogs', u'dachshund', u'pig', u'hamster', u'goat', u'rabbit', u'chimp', u'llama']
Finally, you can also consider the Brown cluster, as a way to speed up the search:
>>> nlp.vocab[u'dog'].cluster
37
>>> nlp.vocab[u'cat'].cluster
37
>>> nlp.vocab[u'imagination'].cluster
1893
>>> nlp.vocab[u'always'].cluster
15994
>>> nlp.vocab[u'goat'].cluster
57
>>> nlp.vocab[u'pig'].cluster
121
Try restricting the candidates to words whose Brown cluster is within some distance of the word you're looking for. I haven't tried this, but it should work pretty well.
Thanks for the detailed reply! I think I'll start with the sense2vec stuff.
Hmm, I'm getting so-so results so far. Not sure I can trust this. [w.lower_ for w in most_similar(nlp.vocab[u'bank'])] ['bank', 'banks', 'banking', 'lender', 'securities', 'corporation', 'uniqlo', 'brb', 'telco', 'btc']
Also, the word "telco" appears as similar to almost every word I try. I'l lkeep working :-)
You might be better off loading different vectors. The default ones are trained on Wikipedia.
What are you trying to do, more broadly?
Maybe I should just construct my own vectors when I think about it, I have plenty of data. The process is to train using gensim or something and then load the vectors through spacy?
Very broadly, I'm trying to compute similarity between short strings. Strings are similar if they share substrings, but also if they contain words with similar meanings. So, "I'm flying with British Airways" and "I'm soaring with British Airlines" should be similar. So I need to answer the question: Does s1 contain a synonym to any of the words in s2?
On 24 February 2016 at 13:44, Matthew Honnibal notifications@github.com wrote:
You might be better off loading different vectors. The default ones are trained on Wikipedia.
What are you trying to do, more broadly?
— Reply to this email directly or view it on GitHub https://github.com/spacy-io/spaCy/issues/276#issuecomment-188236930.
This is a task we'd like to get better at. If you have plenty of data and can share it with us, we might be able to compute a model for you.
I'd try training vectors, yes. Try making the tokens different slices of the spaCy analysis. For instance, you can learn a vector for British_Airways
pretty easily. You can also learn a vector for something like nsubj_fly|ROOT_dobj
, by preprocessing the data like this:
for ent in doc.ents:
ent.merge(ent.root.tag_, ent.text, ent.root.ent_type_)
for word in doc:
left_labels = [w.dep_ for w in word.lefts]
right.labels = [w.dep_ for w in word.rights]
label = word.dep_
text = '{left}_{word}|{label}_{right}'.format(left=left_labels[0] if left_labels else '', word=word.text, right=right_labels[0] if right_labels else '', label=label)
The point here is to use your knowledge about what slices of language are going to be semantically significant, and use spaCy's annotations to identify those slices reliably. Then you use your unannotated text to estimate their meaning.
This is an approximation of the recursive neural tensor network idea that Richard Socher developed at Stanford, before leaving to found MetaMind. The difference is that we accept an approximation --- we'll only learn vectors for certain tree fragments. What we buy from this approximation is the ability to use vanilla word2vec, which makes the model super scalable.
Interesting, thank you! I'm afraid I can't share our data, or open-source it :-(.
As for the broad problem I'm trying to solve, I've been recommended a c library called simstring. But that would require of course the thesaurus as one of the inputs... Smart learning of vectors for our data is something that I've been wanting to do for a long time now, maybe this will give me the extra push!
On 24 February 2016 at 14:10, Matthew Honnibal notifications@github.com wrote:
This is a task we'd like to get better at. If you have plenty of data and can share it with us, we might be able to compute a model for you.
I'd try training vectors, yes. Try making the tokens different slices of the spaCy analysis. For instance, you can learn a vector for British_Airways pretty easily. You can also learn a vector for something like nsubj_fly|ROOT_dobj, by preprocessing the data like this:
for ent in doc.ents: ent.merge(ent.root.tag_, ent.text, ent.root.enttype)for word in doc: leftlabels = [w.dep for w in word.lefts] right.labels = [w.dep for w in word.rights] label = word.dep text = '{left}{word}|{label}{right}'.format(left=left_labels[0] if left_labels else '', word=word.text, right=right_labels[0] if right_labels else '', label=label)
— Reply to this email directly or view it on GitHub https://github.com/spacy-io/spaCy/issues/276#issuecomment-188250125.
Send me an email? matt@spacy.io We can probably provide a bit more detailed help. We always want to know more about what's working well in production contexts for people.
I would also be interested in this, and I can give you a very simple use case. I am working on generating content for a moodle site (https://moodle.org/), which will include a quiz. Moodle lets you create "cloze" questions in the quiz: https://docs.moodle.org/22/en/Embedded_Answers_(Cloze)_question_type#Detailed_syntax_explanations example 1: {1:SHORTANSWER:=Berlin} is the capital of Germany.
example 2: Match the following cities with the correct state:
Given this ability I'm using spacy to parse some text, then automatically create a the cloze questions by identifying NOUNs, then replacing them with the cloze code. In the above example, a good synonym finder would allow good multichoice questions to be made.
I'm running nlp.vocab[u'dog'].cluster
but every word I try returns a cluster of zero. Am I not loading the vocab when I install?
I'm also getting all zeros
@pradipcyb and @sldi42 you need to download the large models to get the vectors per the first important note on this page: https://spacy.io/usage/vectors-similarity
>>> spacy.load('en_core_web_lg').vocab['dog'].cluster
37
>>> spacy.load('en').vocab['dog'].cluster
0
Hi, according to #1561 we should be able to use nlp.vocab.vectors.most_similar
with _md and _lg models. However, the following code raises AxisError: axis 1 is out of bounds for array of dimension 1
:
import spacy
nlp = spacy.load('es_core_news_md')
vector1 = nlp(u"Fruta").vector
result = nlp.vocab.vectors.most_similar(vector1)
Note: nlp(u"Fruta")
does show a (50,) vector.
Does anyone have any ideas of why?
Hello,
This works really well; Is there a way to speed up the process?
thanks
rjs
@CrossNox
I had the similar error AxisError: axis 1 is out of bounds for array of dimension 1
and solved by reshaping the ndarray:
tvec = token.vector
most_similar = doc.vocab.vectors.most_similar(tvec.reshape(1,tvec.shape[0]))
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Title says it all, I guess :-) I'm trying to replace NLTK with spaCy and ran into this little corner. In NLTK I use synsets, which are not the same as synonym of course, but do the trick for now. I know that wordnet is somehow bundled into the spacy corpora, any way to use that?