Clustering - ICU Bag-Of-Words Dataset

This is the Dataset. Make sure to read the readme.txt Also Here is a Wikipedia Article for the BOW-Model. Also Here is the coursera lecture that explained me the concepts in detail (highly recommended).

Here are my notes so far: bag-of-words model:

bag model vs set model
write how many times (count) the word occurs in the doc instead of a boolean.
term count = term frequency
ignore the ordering "John is quicker than Mary" and "Mary is quicker than John" have the same vectors

dimensions: ca 100.000 (vocab) weight measure: TF-IDF similarity measure: Jaccard/Tanimoto or Cosine

jaccard:

does not consider term frequency
does not consider the importance of rare terms

cosine similarity:

better than euclidian distance
calc term frequencies (count)
- log-frequency weighting
- length-normalisation: long and short documents have the same weight

Here is an excel example that I made from the video lecture (IDF weighting is ignored): cosine

Hope it helps anybody. Please leave comments for further suggestions.

TU-Berlin-DIMA / IMPRO-3.SS14

Clustering - ICU Bag-Of-Words Dataset #17