TF-IDF: change to scikit-learn

Orange uses the following formula for IDF: math.log10(number_of_docs/number_of_docs_with_word). In this case, some words become all 0 if they appear in all documents. This has them removed by subsequent preprocessors. To avoid this, one can use Smooth IDF, which uses math.log10(1 + number_of_docs/number_of_docs_with_word).

Why is this a problem? This is not the same as in scikit. a) IDF is math.log10(number_of_docs/(number_of_docs_with_word + 1)) b) Smooth is math.log(1 + number_of_docs+1 / number_of_docs_with_word+1) c) Scikit uses natural log, while we use log10 (not a big issue, as all numbers are multiplied by constant, but still) d) TF, when computing TF-IDF, is not normalized by document length, which is also a standard.

We should probably use scikit here. This would, of course, affect teaching materials.

biolab / orange3-text

TF-IDF: change to scikit-learn #1069