This is the Dataset.
Make sure to read the readme.txt
Also Here is a Wikipedia Article for the BOW-Model.
Also Here is the coursera lecture that explained me the concepts in detail (highly recommended).
Here are my notes so far:
bag-of-words model:
bag model vs set model
write how many times (count) the word occurs in the doc instead of a boolean.
term count = term frequency
ignore the ordering "John is quicker than Mary" and "Mary is quicker than John" have the same vectors
dimensions: ca 100.000 (vocab)
weight measure: TF-IDF
similarity measure: Jaccard/Tanimoto or Cosine
jaccard:
does not consider term frequency
does not consider the importance of rare terms
cosine similarity:
better than euclidian distance
calc term frequencies (count)
log-frequency weighting
length-normalisation: long and short documents have the same weight
Here is an excel example that I made from the video lecture (IDF weighting is ignored):
Hope it helps anybody.
Please leave comments for further suggestions.
This is the Dataset. Make sure to read the readme.txt Also Here is a Wikipedia Article for the BOW-Model. Also Here is the coursera lecture that explained me the concepts in detail (highly recommended).
Here are my notes so far: bag-of-words model:
dimensions: ca 100.000 (vocab) weight measure: TF-IDF similarity measure: Jaccard/Tanimoto or Cosine
jaccard:
cosine similarity:
Here is an excel example that I made from the video lecture (IDF weighting is ignored):
Hope it helps anybody. Please leave comments for further suggestions.