Clustering and Similarity: Retrieving Documents

Questions If we want to recommend an article to a person based on an article he likes

How do we measure similarity ?
How do we search over articles?

Bag of words model

Ignore order of words. Count number of instances of each word in vocabulary " Carlos calls the sport fútbol. Emily calls the sport soccer. "

And then we can create this vector: any other words in you can think of and not in the sentence above, counts 0 屏幕快照 2019-10-14 下午9 46 56

Measuring Similarity 屏幕快照 2019-10-14 下午9 50 05

The green table is a bag model for an article about Messi, and the blue table is an account for Pele. The sum of the similarity of two article is 13

屏幕快照 2019-10-14 下午10 00 51

Compare the soccer article with an article about a conflict in Africa(red table). And the sum is 0

Problem If we prolong the text, just repeat the sentence few times, the sum will be bigger

Solution

Normalize the vector: By doing this, then all the text will be at equal position regardless the length

屏幕快照 2019-10-14 下午10 27 21

Issues with word counts - Rare words

Common words: dominates the similarities among articles

such as : the, and, goal

How to increase the importance of some rare words that really determines the speciality of an article (such as Messi):

What characterises a rare word? Appears infrequently in the corpus
Do we want only rare words to dominate? We only want to focus on important word:
- Appears frequently in document (common locally)
- Appears rarely n corpus (rare globally)
- Trade off between local frequency and global rarity

TF-IDF document representation

Term frequency - inverse document frequency (tf-idf)

Term frequency

inverse document frequency

屏幕快照 2019-10-21 下午6 21 47

屏幕快照 2019-10-21 下午6 50 09

If a word that is very common, then the weight of that word will ≈ 1
And a rare word will have a large weight
The reason of 1 is that: a word does not appear on any document, in this way we can avoid 0 at dominator

An example 屏幕快照 2019-10-21 下午7 16 09 weight of word "the" = 01000 = 0 weight of word "Messi" = 54 = 20

Retrieving similar documents

An famous example is Nearest neighbour search

Specify: Distance metric
Output: Set of most similar articles

Nearest neighbour:
- Input: Query article
- Output: Most similar article
- Algorithm: Search over each article on corpus

屏幕快照 2019-10-21 下午7 23 40

k-Neareast neighbour
- Input: Query article
- Output: List of k similar articles
- The algorithm is the same as above, but instead of return one article, this method returns a list of articles that ranked by similarity.

Clustering documents

Supervised Learning problem

Structure documents by topic
- We have some articles that already have labels
- We have some articles that need to be classified

Unsupervised learning

No labels provided
Wan to uncover cluster structure
Input : Docs as vectors
Output: Cluster labels

屏幕快照 2019-10-23 下午9 46 18

What defines a cluster?

Cluster defined by centre & she /spread
Assign observation(doc) to cluster (topic label) to cluster:
- Score under cluster is higher than others
- Often, just more similar to assigned cluster centre than other cluster centres ( compare the distance of an observation(the dot in the graph) to the centres of each shape)
- In the case below, the blue dot's(an article) cluster is difficult to assign, but the purple dot(another article) is easy to define.

K means: A clustering algorithm

Initialize cluster centre.
Assign observations to closet cluster centre.

Zixi-L / Maching_Learning