Zixi-L / Maching_Learning

0 stars 0 forks source link

Clustering and Similarity: Retrieving Documents #3

Open Zixi-L opened 5 years ago

Zixi-L commented 5 years ago

Questions If we want to recommend an article to a person based on an article he likes


Bag of words model

Ignore order of words. Count number of instances of each word in vocabulary " Carlos calls the sport fútbol. Emily calls the sport soccer. "

And then we can create this vector: any other words in you can think of and not in the sentence above, counts 0 屏幕快照 2019-10-14 下午9 46 56

Measuring Similarity 屏幕快照 2019-10-14 下午9 50 05

The green table is a bag model for an article about Messi, and the blue table is an account for Pele. The sum of the similarity of two article is 13

屏幕快照 2019-10-14 下午10 00 51

Compare the soccer article with an article about a conflict in Africa(red table). And the sum is 0

Problem If we prolong the text, just repeat the sentence few times, the sum will be bigger

Solution

屏幕快照 2019-10-14 下午10 27 21

Zixi-L commented 5 years ago

Issues with word counts - Rare words

Common words: dominates the similarities among articles

How to increase the importance of some rare words that really determines the speciality of an article (such as Messi):

TF-IDF document representation

Term frequency - inverse document frequency (tf-idf)

Term frequency

inverse document frequency

屏幕快照 2019-10-21 下午6 21 47

屏幕快照 2019-10-21 下午6 50 09

An example 屏幕快照 2019-10-21 下午7 16 09 weight of word "the" = 01000 = 0 weight of word "Messi" = 54 = 20

Retrieving similar documents

An famous example is Nearest neighbour search

  1. Nearest neighbour:
    • Input: Query article
    • Output: Most similar article
    • Algorithm: Search over each article on corpus

屏幕快照 2019-10-21 下午7 23 40

  1. k-Neareast neighbour
    • Input: Query article
    • Output: List of k similar articles
    • The algorithm is the same as above, but instead of return one article, this method returns a list of articles that ranked by similarity.
Zixi-L commented 5 years ago

Clustering documents

Supervised Learning problem

Unsupervised learning

屏幕快照 2019-10-23 下午9 46 18

What defines a cluster?

K means: A clustering algorithm

  1. Initialize cluster centre.
  2. Assign observations to closet cluster centre.