biolab / orange3-text

🍊 :page_facing_up: Text Mining add-on for Orange3
Other
128 stars 84 forks source link

TF-IDF: change to scikit-learn #1069

Open ajdapretnar opened 4 months ago

ajdapretnar commented 4 months ago

Orange uses the following formula for IDF: math.log10(number_of_docs/number_of_docs_with_word). In this case, some words become all 0 if they appear in all documents. This has them removed by subsequent preprocessors. To avoid this, one can use Smooth IDF, which uses math.log10(1 + number_of_docs/number_of_docs_with_word).

Why is this a problem? This is not the same as in scikit. a) IDF is math.log10(number_of_docs/(number_of_docs_with_word + 1)) b) Smooth is math.log(1 + number_of_docs+1 / number_of_docs_with_word+1) c) Scikit uses natural log, while we use log10 (not a big issue, as all numbers are multiplied by constant, but still) d) TF, when computing TF-IDF, is not normalized by document length, which is also a standard.

We should probably use scikit here. This would, of course, affect teaching materials.