Using word embeddings to score/rank candidate key-phrases

vdpappu commented 5 years ago

Current key-phrases scoring is based on graph metrics and grammar rules and are not contextually scored. Using BERT based approach is not computationally feasible. A light-weight approach that can use the pre-trained word embedding dictionary should be considered to have better contextual key-phrases in chapters and PIMs

vdpappu commented 5 years ago

Current Approach

Train word embeddings on Slack and meeting text
Use grammar rules/regex patterns to extract candidate key-phrases
Score each candidate based on its relevance to the sentence it belongs to
Build graph with nodes as candidates and node weights as key-phrase relevance (calculated in step-2)
Experiment with various graph properties for node/key-phrase scoring

shashankpr commented 5 years ago

Exploring the usage of Google's Universal Sentence Encoder to compare word and sentence similarity for keyphrase scoring. The initial basic testing is positive and encouraging.

shashankpr commented 5 years ago

Testing with an out-of-the-box Universal Sentence Encoder model has yielded good result. The process of ranking is as follows:

Keyword graph is built and cosine similarity score between every connected nodes is used as the edge weight.
Bottom 10 percentile of the edge connections are removed (i.e nodes with lower similarity scores are not connected anymore.)
Weighted PageRank is used to get the pagerank scores for each node. These scores are considered as the global score for each keyword.
For a given segment, its similarity is calculated with all of its candidate keyphrases. This score is treated as the local relevance.
For final ranking, currently, a simple sum of local relevance score and global pagerank score is taken.

With a simple summation itself the results are better than current production keyphrase implementation. Mean average precision (MAP) of 0.76 is obtained when compared against the ranking given by ML team.

Example: Segment

No problem. I was born in Poland and like many other Engineers I grew up reading inspection if I could this is my favorite. I'll talk with Hall was actually very close to his house and crackle and little did. I know that his Works would have major influence on my life. Did you in science fiction to do you remember the moment when the protagonist while busy saving the world asks her computer for something computer make it so and the Machine just does it and it dealt it now, why can't the real world people like this? For example, when a user sign OK Google and ask for that application will surface most suitable to fulfill the request should be open and get the job done just like that.Today, I want to talk to you about Technology based on schema.org actions, which I think is a small step in that direction with actions. We have an opportunity to like your users and bring more engagement to your app. However, we can't do it alone. We need your help at Google. We have been working on organizing the world's information to make it universally accessible and useful to help with that. We build a Knowledge Graph the knowledge graph contains information about entities and their relationships. One of the interesting applications of the knowledge options resolving ambiguities will processing language points, for example, which is different than the concept of dual core. As far as the strings are concerned these two are equal but to a user who is asking their phone to they do Court the difference is quite clear.Dual core machine ID in the graph. We like to refer to these types of object-based identifiers as links not strict having a graph entity with a machine ID or miss the knowledge graph can also help satisfy user requests North.

Original ranking

1, 'help satisfy user requests North',
2, 'knowledge graph',
3,  'Dual core machine',
4, 'user sign',
5, 'language points',
6, 'graph entity', 
7, 'interesting applications', 
8, 'computer', 
9, 'knowledge options',,
10, 'org actions', 
11,  'Google', 
12, 'small step', 
13,  'major influence', 
14,  'science fiction',
15,  'object-based identifiers', 
16,  'links',
17, 'people', 
18, 'moment',
19, 'entities', 
20, 'engagement'

One of the ML/AI team's rankings

1, 'knowledge graph'
2, 'graph entity'
3, 'interesting applications'
4, 'knowledge options'
5, 'object-based identifiers'
6, 'entities'
7, 'language points'
8, 'Google'
9, 'help satisfy user requests North'
10, 'engagement'

This method's ranking

1, 'knowledge options', 
2, 'knowledge graph', 
3, 'graph entity',
4, 'help satisfy user requests North',
5, 'schema',
6, 'object-based identifiers',
7, 'Google', 
8, 'interesting applications',
9, 'language points',
10, 'Dual core machine'

shashankpr commented 5 years ago

The keyphrase ranking method is deployed in staging2. This uses Universal sentence encoder, which has been found to better than sole word embeddings during testing. Refer #99

etherlabsio / ai-engine

Using word embeddings to score/rank candidate key-phrases #66