How to embed word vectors in solr

You have to make the outputs of word2vec work within an inverted index, which is works using sparse tokens (i.e. words) and not dense vectors.

In a nutshell:

Extract keywords and phrases (phrases are v important, treating hadoop and developer as two separate keywords and averaging the vectors didn't work well, treating hadoop_developer in word2vec as a single token works very well in contrast). How you do this is an NLP question, search for research into identifying colocations. PMI is one way to do this, there are many others
Train a word2vec model, that includes words and phrases from 1
Then either
- query word2vec for all words and phrases from 1, and take the top n (say 10) terms, ranked by similarity. word2vec supports this sort of query. Then at query time, use these keywords to do query expansion, using the cosine similarity to impact the word or phrasal boosts. However, make sure the original query terms are still given the highest boost. Also make sure the query is not an AND query, as we want it to match any of the associated expansion terms or phrases
  - or cluster the embedding vectors using a clustering algo, e.g. k-means. Then map each map word->vector->cluster. Then assign a unique id to each cluster. Then at index time, index words/phrases into a cluster field, containing these cluster ids. at query time, looking up corresponding cluster id, and search on the cluster id
  - (so q=^5 OR cluster_field:^1). Tune for appropriate query boosts, in place of 5 and 1

All of the code in the repo will help you do the above, including phrasal identification, but it may not be 100% clear what is doing what. I strongly recommend reading the associated power point deck and watching the Lucene Revolution talk (linked to in GIthub) before going any further, if you haven';t already.

HTH

DiceTechJobs / ConceptualSearch

How to embed word vectors in solr #7