TU-Berlin-DIMA / IMPRO-3.SS14

KMEans
2 stars 0 forks source link

Clustering - ICU Bag-Of-Words Dataset #17

Closed oresti closed 10 years ago

oresti commented 10 years ago

This is the Dataset. Make sure to read the readme.txt Also Here is a Wikipedia Article for the BOW-Model. Also Here is the coursera lecture that explained me the concepts in detail (highly recommended).

Here are my notes so far: bag-of-words model:

dimensions: ca 100.000 (vocab) weight measure: TF-IDF similarity measure: Jaccard/Tanimoto or Cosine

jaccard:

cosine similarity:

Here is an excel example that I made from the video lecture (IDF weighting is ignored): cosine

Hope it helps anybody. Please leave comments for further suggestions.

oresti commented 10 years ago

closing this.