DiceTechJobs / ConceptualSearch

Train a Word2Vec model or LSA model, and Implement Conceptual Search\Semantic Search in Solr\Lucene - Simon Hughes Dice.com, Dice Tech Jobs
http://www.dice.com
Apache License 2.0
257 stars 58 forks source link

How to embed word vectors in solr #7

Open damachi49 opened 5 years ago

damachi49 commented 5 years ago

@simonhughes22

Hello,

I was wondering, what were the steps you made to embed the trained word embeddings in solr. There seems to be no documentation on how to do it in Solr. Thanks a lot for your time and help.

Best regards

simonhughes22 commented 5 years ago

You have to make the outputs of word2vec work within an inverted index, which is works using sparse tokens (i.e. words) and not dense vectors.

In a nutshell:

  1. Extract keywords and phrases (phrases are v important, treating hadoop and developer as two separate keywords and averaging the vectors didn't work well, treating hadoop_developer in word2vec as a single token works very well in contrast). How you do this is an NLP question, search for research into identifying colocations. PMI is one way to do this, there are many others
  2. Train a word2vec model, that includes words and phrases from 1
  3. Then either
    • query word2vec for all words and phrases from 1, and take the top n (say 10) terms, ranked by similarity. word2vec supports this sort of query. Then at query time, use these keywords to do query expansion, using the cosine similarity to impact the word or phrasal boosts. However, make sure the original query terms are still given the highest boost. Also make sure the query is not an AND query, as we want it to match any of the associated expansion terms or phrases
      • or cluster the embedding vectors using a clustering algo, e.g. k-means. Then map each map word->vector->cluster. Then assign a unique id to each cluster. Then at index time, index words/phrases into a cluster field, containing these cluster ids. at query time, looking up corresponding cluster id, and search on the cluster id
      • (so q=^5 OR cluster_field:^1). Tune for appropriate query boosts, in place of 5 and 1

All of the code in the repo will help you do the above, including phrasal identification, but it may not be 100% clear what is doing what. I strongly recommend reading the associated power point deck and watching the Lucene Revolution talk (linked to in GIthub) before going any further, if you haven';t already.

HTH