Open Zixi-L opened 5 years ago
Common words: dominates the similarities among articles
How to increase the importance of some rare words that really determines the speciality of an article (such as Messi):
Term frequency - inverse document frequency (tf-idf)
Term frequency
inverse document frequency
An example weight of word "the" = 01000 = 0 weight of word "Messi" = 54 = 20
An famous example is Nearest neighbour search
Questions If we want to recommend an article to a person based on an article he likes
Bag of words model
Ignore order of words. Count number of instances of each word in vocabulary " Carlos calls the sport fútbol. Emily calls the sport soccer. "
And then we can create this vector: any other words in you can think of and not in the sentence above, counts 0
Measuring Similarity
The green table is a bag model for an article about Messi, and the blue table is an account for Pele. The sum of the similarity of two article is 13
Compare the soccer article with an article about a conflict in Africa(red table). And the sum is 0
Problem If we prolong the text, just repeat the sentence few times, the sum will be bigger
Solution