Open murphp15 opened 6 years ago
Cosine distance is often used with TF-IDF to ensure the model is not biased towards paragraphs based on length. That is, the magnitude is ignored by design.
Note it still matters how many times the words occurs, if one words occurs tens times more often then another word the normalized TF-IDF vector will still "point" to the more common word more than the uncommon word.
In the method score_paragraphs(in the class ShallowOpenWebRanker) the cosine distant metric is use. Should it not be using the dot product metric, because the cosine metric does not take into account magnitude and tfidf is direction and magnitude? From what I can see this would be similar to saying does the paragraph contain the word and not taking into account how many times the word occurs.