[CLOSED] Use word2vec for phrase similarity

sbhaktha commented 8 years ago

Issue by afader Wed Mar 18 16:42:32 2015 Originally opened as https://github.com/allenai/okcorpus/issues/40

Right now, we are using Brown clusters for a notion of word similarity in OKC. The benefit of this is that the clusters are discrete and prefix-based, so we can easily add them as token annotations in BlackLab. The disadvantages are:

It only works on single words (no multiword phrases)
It's very, very slow
It may not be the best way to compute similarity

I was playing around with word2vec, and the results on the ACL corpus are great. For example, here are the phrases most similar to information_extraction:

                                              Word       Cosine distance
------------------------------------------------------------------------
                                       text_mining      0.826766
                                question_answering      0.796270
                              knowledge_extraction      0.772408
                                 ontology_learning      0.764749
                           automatic_summarization      0.752856
                      automatic_text_summarization      0.748179
                                                ie      0.742152
                                 entity_extraction      0.737166
                               knowledge_discovery      0.737004
                                   fact_extraction      0.735164
                                   image_retrieval      0.731172
                             information_retrieval      0.725849
                 biomedical_information_extraction      0.723953
                                information_fusion      0.721746
       applications_such_as_information_extraction      0.721209
                                 literature_mining      0.716144
                               relation_extraction      0.713560
                                text_summarization      0.710251
                          named_entity_recognition      0.708391
                                    record_linkage      0.705008
                                text_understanding      0.704244
                     such_as_information_retrieval      0.698317
                    such_as_information_extraction      0.698231
                               ontology_population      0.697375
                    information_extraction_systems      0.694957
            many_natural_language_processing_tasks      0.686736
                                    opinion_mining      0.684321
                                       open_domain      0.683641
                                entity_recognition      0.682989
                        cross_-_language_retrieval      0.681546
                        question_answering_systems      0.681245
                           document_classification      0.680587
                                   computer_vision      0.679659
                           named_entity_extraction      0.679576
                    textual_entailment_recognition      0.677568
                        named_-_entity_recognition      0.675705
                       natural_language_processing      0.675098
                                       data_mining      0.674606
           applications_such_as_question_answering      0.674527
                                opinion_extraction      0.673862

Here are the most similar phrases to outperforms:

                                              Word       Cosine distance
------------------------------------------------------------------------
                              performs_better_than      0.914765
                         significantly_outperforms      0.894385
                                      outperformed      0.879147
                                     improves_over      0.859142
                         performs_much_better_than      0.842823
                               performs_worse_than      0.832298
                          consistently_outperforms      0.831501
                                  outperforms_both      0.828320
                               clearly_outperforms      0.821622
                performs_significantly_better_than      0.817346
                                    is_superior_to      0.816786
                                             beats      0.806530
                             performed_better_than      0.805698
                     performs_slightly_better_than      0.804796
                                     outperforming      0.800169
                               performs_as_well_as      0.792212
                                 still_outperforms      0.788039
                                   outperforms_all      0.785506
                                  does_better_than      0.784221
                                  also_outperforms      0.782234
                  achieves_better_performance_than      0.774746
                                         surpasses      0.770587
                        significantly_outperformed      0.770434
                                 works_better_than      0.765581
                                    can_outperform      0.762990
                                 model_outperforms      0.762042
                              slightly_outperforms      0.759185
                               does_not_outperform      0.758555
                             our_model_outperforms      0.757645
                         substantially_outperforms      0.752425
                                     underperforms      0.751935
                                method_outperforms      0.749556
                                always_outperforms      0.740629
                                     performs_best      0.736030
                            performs_comparably_to      0.735505
                              performed_worse_than      0.734069
                              approach_outperforms      0.733126
                       is_consistently_better_than      0.729947
                                 models_outperform      0.723421
                                  even_outperforms      0.722097

So, word2vec is:

Many orders of magnitude faster than Brown clustering
Works on multiword phrases
Looks really good

I think that using this information could have a big impact on usability, since you can stop having to think in terms of single words, but still get a good notion of similarity.

The downside is that multiword phrases cannot be indexed like Brown clusters.

To add word2vec similarity into OKC, we can do the following:

Add a new web API similarPhrases(phrase: Seq[String], threshold: Double): Seq[Seq[String]] that returns the phrases that have similarity within the given threshold
Have the slider in the UI control the threshold
The phrases returned by similarPhrases are combined into a disjunction, which can then be queried against the index

@dirkgr and @chrisc36 any thoughts/opinions on this?

sbhaktha commented 8 years ago

Comment by dirkgr Wed Mar 18 17:29:51 2015

I’m afraid of the impact on query times. We’re talking about solving a lot of problems by using big disjunctions.

How big is your vocabulary when you do this? Finding the k most similar items, for small k, by cosine distance is O(n*m), where n is the size of the vocabulary, and m is the length of the word vectors. In my Python experiments, for vectors of length 100 and 5M vocabulary items, it takes just under a minute to do this. Scala is faster, and we can use multiple cores, so this is feasible. Storing the vectors in memory takes almost 16GB though.

Maybe we can do hierarchical clustering based on cosine distance? Then it’s slow again, but at least the quality would be better.

Is it possible in BlackLab to index phrases on top of tokens?

Dirk

On March 18, 2015 at 09:42:35, Tony Fader (notifications@github.com) wrote:

Right now, we are using Brown clusters for a notion of word similarity in OKC. The benefit of this is that the clusters are discrete and prefix-based, so we can easily add them as token annotations in BlackLab. The disadvantages are:

It only works on single words (no multiword phrases) It's very, very slow It may not be the best way to compute similarity I was playing around with word2vec, and the results on the ACL corpus are great. For example, here are the phrases most similar to information_extraction:

                                          Word       Cosine distance

                                   text_mining      0.826766
                            question_answering      0.796270
                          knowledge_extraction      0.772408
                             ontology_learning      0.764749
                       automatic_summarization      0.752856
                  automatic_text_summarization      0.748179
                                            ie      0.742152
                             entity_extraction      0.737166
                           knowledge_discovery      0.737004
                               fact_extraction      0.735164
                               image_retrieval      0.731172
                         information_retrieval      0.725849
             biomedical_information_extraction      0.723953
                            information_fusion      0.721746
   applications_such_as_information_extraction      0.721209
                             literature_mining      0.716144
                           relation_extraction      0.713560
                            text_summarization      0.710251
                      named_entity_recognition      0.708391
                                record_linkage      0.705008
                            text_understanding      0.704244
                 such_as_information_retrieval      0.698317
                such_as_information_extraction      0.698231
                           ontology_population      0.697375
                information_extraction_systems      0.694957
        many_natural_language_processing_tasks      0.686736
                                opinion_mining      0.684321
                                   open_domain      0.683641
                            entity_recognition      0.682989
                    cross_-_language_retrieval      0.681546
                    question_answering_systems      0.681245
                       document_classification      0.680587
                               computer_vision      0.679659
                       named_entity_extraction      0.679576
                textual_entailment_recognition      0.677568
                    named_-_entity_recognition      0.675705
                   natural_language_processing      0.675098
                                   data_mining      0.674606
       applications_such_as_question_answering      0.674527
                            opinion_extraction      0.673862

Here are the most similar phrases to outperforms:

                                          Word       Cosine distance

                          performs_better_than      0.914765
                     significantly_outperforms      0.894385
                                  outperformed      0.879147
                                 improves_over      0.859142
                     performs_much_better_than      0.842823
                           performs_worse_than      0.832298
                      consistently_outperforms      0.831501
                              outperforms_both      0.828320
                           clearly_outperforms      0.821622
            performs_significantly_better_than      0.817346
                                is_superior_to      0.816786
                                         beats      0.806530
                         performed_better_than      0.805698
                 performs_slightly_better_than      0.804796
                                 outperforming      0.800169
                           performs_as_well_as      0.792212
                             still_outperforms      0.788039
                               outperforms_all      0.785506
                              does_better_than      0.784221
                              also_outperforms      0.782234
              achieves_better_performance_than      0.774746
                                     surpasses      0.770587
                    significantly_outperformed      0.770434
                             works_better_than      0.765581
                                can_outperform      0.762990
                             model_outperforms      0.762042
                          slightly_outperforms      0.759185
                           does_not_outperform      0.758555
                         our_model_outperforms      0.757645
                     substantially_outperforms      0.752425
                                 underperforms      0.751935
                            method_outperforms      0.749556
                            always_outperforms      0.740629
                                 performs_best      0.736030
                        performs_comparably_to      0.735505
                          performed_worse_than      0.734069
                          approach_outperforms      0.733126
                   is_consistently_better_than      0.729947
                             models_outperform      0.723421
                              even_outperforms      0.722097

So, word2vec is:

Many orders of magnitude faster than Brown clustering Works on multiword phrases Looks really good I think that using this information could have a big impact on usability, since you can stop having to think in terms of single words, but still get a good notion of similarity.

The downside is that multiword phrases cannot be indexed like Brown clusters.

To add word2vec similarity into OKC, we can do the following:

Add a new web API similarPhrases(phrase: Seq[String], threshold: Double): Seq[Seq[String]] that returns the phrases that have similarity within the given threshold Have the slider in the UI control the threshold The phrases returned by similarPhrases are combined into a disjunction, which can then be queried against the index @dirkgr and @chrisc36 any thoughts/opinions on this?

— Reply to this email directly or view it on GitHub.

sbhaktha commented 8 years ago

Comment by jakemannix Wed Mar 18 18:11:26 2015

I wonder if using MinHash or some other locality sensitive hashing, on top of word2vec, would allow for finding the k-most-similar items in a much faster time? In particular, I think it scales independently of n (the vocabulary size).

sbhaktha commented 8 years ago

Comment by cristipp Wed Mar 18 18:32:06 2015

MinHash over sentences? Does that make sense in this context?

On Wed, Mar 18, 2015 at 11:11 AM, Jake Mannix notifications@github.com wrote:

I wonder if using MinHash or some other locality sensitive hashing, on top of word2vec, would allow for finding the k-most-similar items in a much faster time? In particular, I think it scales independently of n (the vocabulary size).

— Reply to this email directly or view it on GitHub https://github.com/allenai/okcorpus/issues/40#issuecomment-83105379.

sbhaktha commented 8 years ago

Comment by jakemannix Wed Mar 18 18:36:53 2015

You can minhash any vector. I'm saying take the vectors produced by word2vec, and minhash those so that later similarity comparisons can be done on the fly much quicker, since @dirkgr was concerned about the runtime performance of the similarity computation.

sbhaktha commented 8 years ago

Comment by cristipp Wed Mar 18 18:43:27 2015

MinHash specifically looks for identical sub-vectors in the vector population. That's different than word2vec, which uses a cosine similarity metric.

That being said, there may be tricks to speedup cosine similarity, but IMHO more likely using geometry tricks, like kd-trees. But then you hit the curse of dimensionality :( I wish I knew a good solution here.

On Wed, Mar 18, 2015 at 11:36 AM, Jake Mannix notifications@github.com wrote:

You can minhash any vector. I'm saying take the vectors produced by word2vec, and minhash those so that later similarity comparisons can be done on the fly much quicker, since @dirkgr https://github.com/dirkgr was concerned about the runtime performance of the similarity computation.

— Reply to this email directly or view it on GitHub https://github.com/allenai/okcorpus/issues/40#issuecomment-83114524.

sbhaktha commented 8 years ago

Comment by afader Wed Mar 18 18:47:37 2015

I think there are two issues:

Executing the similarPhrases function, which requires a k-nearest neighbor query
Searching over Lucene with a big disjunction of phrases

The first issue doesn't seem like a problem:

The distance.c code that comes with word2vec does it really fast---so there's hope!
We can restrict the input to only use a subset of the vocabulary.
Like @jakemannix says, we can use LSH to prevent quadratic blowup.
We don't have to worry about memory---we can pre-compute and store in a database.

The second issue seems like more of a problem, and we still don't know much about it...

sbhaktha commented 8 years ago

Comment by jakemannix Wed Mar 18 18:48:36 2015

Ok, maybe I'm mischoosing MinHash as a LSH implementation, but the LSH that I've done in the past works with cosine similarity: it's just bucketing (with high probability) all vectors with some angle 't' of a bunch of centroids, so in particular puts all vectors with high cosine similarity in the same bucket.

On Wed, Mar 18, 2015 at 11:43 AM, cristipp notifications@github.com wrote:

MinHash specifically looks for identical sub-vectors in the vector population. That's different than word2vec, which uses a cosine similarity metric.

That being said, there may be tricks to speedup cosine similarity, but IMHO more likely using geometry tricks, like kd-trees. But then you hit the curse of dimensionality :( I wish I knew a good solution here.

On Wed, Mar 18, 2015 at 11:36 AM, Jake Mannix notifications@github.com wrote:

You can minhash any vector. I'm saying take the vectors produced by word2vec, and minhash those so that later similarity comparisons can be done on the fly much quicker, since @dirkgr https://github.com/dirkgr was concerned about the runtime performance of the similarity computation.

— Reply to this email directly or view it on GitHub https://github.com/allenai/okcorpus/issues/40#issuecomment-83114524.

— Reply to this email directly or view it on GitHub https://github.com/allenai/okcorpus/issues/40#issuecomment-83117232.

-jake

sbhaktha commented 8 years ago

Comment by jakemannix Wed Mar 18 18:58:52 2015

So the second problem I've run into a bunch - it was the bottleneck for computing recommendations in LinkedIn's recommender, as it at a core took a precomputed set of tokens (and phrases) and compute a weighted OR query over the lucene index of the profiles of everyone at LinkedIn. This disjunction ended up being nearly a table-scan at first.

There are a variety of ways to approach this query-time perf, though, but it depends on your scoring model: you want to get the "top k" results, right? But what is your scoring function? Default Lucene / BlackLab? Does it matter much for this system? You really only want an approximation to the best results, right? There should a bunch of ways we could speed it up if we're willing to live with some false negatives in the top k.

On Wed, Mar 18, 2015 at 11:47 AM, Tony Fader notifications@github.com wrote:

I think there are two issues:

Executing the similarPhrases function, which requires a k-nearest neighbor query

Searching over Lucene with a big disjunction of phrases

The first issue doesn't seem like a problem:

The distance.c code that comes with word2vec does it really fast---so there's hope!

We can restrict the input to only use a subset of the vocabulary.

Like @jakemannix https://github.com/jakemannix says, we can use LSH to prevent quadratic blowup.

We don't have to worry about memory---we can pre-compute and store in a database.

The second issue seems like more of a problem, and we still don't know much about it...

— Reply to this email directly or view it on GitHub https://github.com/allenai/okcorpus/issues/40#issuecomment-83118280.

-jake

sbhaktha commented 8 years ago

Comment by afader Wed Mar 18 19:10:06 2015

I think we're interested in the "top k" case---given this query, find the top k matches in the corpus. There is no notion of scoring in BlackLab, so this is the top k ordered by when they were added to the index.

For queries that match a large fraction of the data, false negatives are fine. The user won't be able to look at all of the results, anyhow.

For queries that match a small fraction of the data, false positives are probably worse.

sbhaktha commented 8 years ago

Comment by afader Wed Mar 18 22:53:45 2015

I hacked something together in 50816d852c08d648d2116240848cbb2b3eb2f572 to test this idea out...

Shelling out to distance.c is very fast, and queries involving multiple disjunctions (e.g. take the 100-nearest neighbors of "we use" and "penn treebank" returns results very quickly).

I think we should do this!

sbhaktha commented 8 years ago

Comment by cristipp Wed Mar 18 23:02:15 2015

Heh, I've looked into the branch code, but darn if I can find the place where "Shelling out to distance.c" actually happens.

On Wed, Mar 18, 2015 at 3:53 PM, Tony Fader notifications@github.com wrote:

I hacked something together in 50816d8 https://github.com/allenai/okcorpus/commit/50816d852c08d648d2116240848cbb2b3eb2f572 to test this idea out...

Shelling out to distance.c is very fast, and queries involving multiple disjunctions (e.g. take the 100-nearest neighbors of "we use" and "penn treebank" returns results very quickly).

I think we should do this!

— Reply to this email directly or view it on GitHub https://github.com/allenai/okcorpus/issues/40#issuecomment-83217418.

sbhaktha commented 8 years ago

Comment by afader Wed Mar 18 23:06:19 2015

@cristipp It's in that python script. I wasn't kidding when I said "hacked". word2vec-server.py starts a ./distance process and a webserver that handles requests to get the k-NN.

sbhaktha commented 8 years ago

Comment by afader Wed Mar 18 23:07:35 2015

@cristipp script is here: https://github.com/allenai/okcorpus/blob/50816d852c08d648d2116240848cbb2b3eb2f572/src/main/bin/word2vec-server.py

Obviously need to do this differently, but I'm feeling optimistic that it's worth while.

allenai / ike

[CLOSED] Use word2vec for phrase similarity #42