Orange uses the following formula for IDF: math.log10(number_of_docs/number_of_docs_with_word). In this case, some words become all 0 if they appear in all documents. This has them removed by subsequent preprocessors. To avoid this, one can use Smooth IDF, which uses math.log10(1 + number_of_docs/number_of_docs_with_word).
Why is this a problem? This is not the same as in scikit.
a) IDF is math.log10(number_of_docs/(number_of_docs_with_word + 1))
b) Smooth is math.log(1 + number_of_docs+1 / number_of_docs_with_word+1)
c) Scikit uses natural log, while we use log10 (not a big issue, as all numbers are multiplied by constant, but still)
d) TF, when computing TF-IDF, is not normalized by document length, which is also a standard.
We should probably use scikit here. This would, of course, affect teaching materials.
Orange uses the following formula for IDF:
math.log10(number_of_docs/number_of_docs_with_word)
. In this case, some words become all 0 if they appear in all documents. This has them removed by subsequent preprocessors. To avoid this, one can use Smooth IDF, which usesmath.log10(1 + number_of_docs/number_of_docs_with_word)
.Why is this a problem? This is not the same as in scikit. a) IDF is
math.log10(number_of_docs/(number_of_docs_with_word + 1))
b) Smooth ismath.log(1 + number_of_docs+1 / number_of_docs_with_word+1)
c) Scikit uses natural log, while we use log10 (not a big issue, as all numbers are multiplied by constant, but still) d) TF, when computing TF-IDF, is not normalized by document length, which is also a standard.We should probably use scikit here. This would, of course, affect teaching materials.