explosion / sense2vec

🦆 Contextually-keyed word vectors
https://explosion.ai/blog/sense2vec-reloaded
MIT License
1.62k stars 240 forks source link

Reproduce https://demos.explosion.ai/sense2vec quality results locally #151

Closed cnantoninor closed 1 year ago

cnantoninor commented 1 year ago

👋🏽 Hello team and thank for your work!

I tried to reproduce https://demos.explosion.ai/sense2vec results in a notebook, but I wasn't able. For example, for Real_Estate|NOUN the demo UI is returning:

real estate 76%
real estate market 71%
housing market 68%
real estate investment 67%
Foreign investment 67%
commercial real estate 67%
Rental properties 66%
Home ownership 66%
RE market 66%
Housing market 66%
residential real estate 65%

while locally I am seeing:

[('sunny_day|PROPN', 0.7656),
 ('Luttrell|PROPN', 0.6905),
 ('Starting_Line|NOUN', 0.68),
 ('Single_Version|PROPN', 0.6745),
 ('Blue_Jeans|PROPN', 0.6741),
 ('Cinematic_Orchestra|NOUN', 0.6719),
 ("Marvin_Gaye_-_What's|PERSON", 0.6713),
 ('Janelle_Monae_-_Dirty|PERSON', 0.6665),
 ('Best_Part|NOUN', 0.6662),
 ('Amon_Tobin_-|PERSON', 0.6655)]

The ones from the demo UI seem (much) better to me 😃

Please find below the way I produced the local results:

import spacy
from sense2vec import Sense2VecComponent

nlp = spacy.load("en_core_web_lg")
s2v = nlp.add_pipe("sense2vec")
s2v.from_disk("../bloomberg/rsrcheng/rsrcheng_queryexp_spike/data/s2v_reddit_2019_lg/")
s2v.s2v.most_similar('Real_Estate|NOUN')

What am I missing here?

Thanks in advance!

adrianeboyd commented 1 year ago

I think the issue that it's sensitive to casing. I get the same results with Real_Estate in the demo as with the local model. Try real_estate|NOUN instead?

Compare:

https://demos.explosion.ai/sense2vec?word=Real_Estate&sense=NOUN&model=2015 https://demos.explosion.ai/sense2vec?word=real_estate&sense=NOUN&model=2015

cnantoninor commented 1 year ago

🤦🏽‍♂️ that made the trick, thanks!