Why is cosine distance used in document_splitter.py

allenai / document-qa

Apache License 2.0

434 stars 122 forks source link

Why is cosine distance used in document_splitter.py #35

Open murphp15 opened 6 years ago

murphp15 commented 6 years ago

In the method score_paragraphs(in the class ShallowOpenWebRanker) the cosine distant metric is use. Should it not be using the dot product metric, because the cosine metric does not take into account magnitude and tfidf is direction and magnitude? From what I can see this would be similar to saying does the paragraph contain the word and not taking into account how many times the word occurs.

chrisc36 commented 6 years ago

Cosine distance is often used with TF-IDF to ensure the model is not biased towards paragraphs based on length. That is, the magnitude is ignored by design.

Note it still matters how many times the words occurs, if one words occurs tens times more often then another word the normalized TF-IDF vector will still "point" to the more common word more than the uncommon word.