Open lubomirkrcmar opened 11 years ago
Hi Luboš,
You bring up a good point. Our implementation of TF-IDF is using the term's probability in the document, rather than its frequency. Using the probability discounts the impact of different sized documents where the frequencies for a single term may differ significantly.
There are several ways to adjust the TF value (the Wikipedia page mentions a few others, as well), but I don't think our docs mention anywhere which one we're using. It would be pretty helpful to be able to adjust the TF transform as well. I don't think this is too much work, so if you want, I'm happy to try extending the code with a few options.
Thanks, David
On Mon, May 20, 2013 at 3:13 PM, lubomirkrcmar notifications@github.comwrote:
Hi SSpace team,
I believe, TFIDF is not calculated correctly. Am I right?
Why the value in the following is divided by docTermCount[column]? I think, there should be just tf = value; since tf stands for term frequency in a certain document and does not stand for probability of the term frequency in the document (division case). At least, wikipedia reffered in the code claims so.
class TfIdfTransform in edu.ucla.sspace.matrix:
public double transform(int row, int column, double value) { double tf = value / docTermCount[column]; double idf = Math.log(totalDocCount / (termDocCount[row] + 1)); return tf * idf; }
the same in the following method: public double transform(int row, DoubleVector column) { ... }
Cheers, Luboš
— Reply to this email directly or view it on GitHubhttps://github.com/fozziethebeat/S-Space/issues/41 .
Hi David,
I understand, thanks. I do some experiments in Informational Retrieval and yes, I would welcome more variants of the TfIdf Transform. Also, it would be nice to modify the comment in the currently implemented variant - I mean to include that tf is normalized.
Manning in Introduction to Information Retrieval (2008) besides from "term frequency" and "document frequency" weighting writes about "normalization weighting": 1/u (pivoted normalization) corresponds to what You use in Your tfIDf implementation in SSPace, I believe.
Currently (-term frequency w., document frequency w., and normalization weighting-), in Your Project, the NoTransform class corresponds to -tf, no, noNorm-, TfIdfTransform to -tf, Idf, pivotNorm- and LogEntropyTransform to -tf, Entropy, NoNorm-. There are also some other classes, which I do not know well yet, such as PointwiseMutualnformationTransform and LogLikelihoodTransform.
I believe TfIdfNoNormTransform and LogIDfNoNormTranform might be interesting candidates for implementation. Maybe, also sqrt weighting of term freqency is interesting (Lucene library uses sqrt.).
Thanks for SSpace! Luboš
Hi SSpace team,
I believe, TFIDF is not calculated correctly. Am I right?
Why the value in the following is divided by docTermCount[column]? I think, there should be just tf = value; since tf stands for term frequency in a certain document and does not stand for probability of the term frequency in the document (division case). At least, wikipedia reffered in the code claims so.
class TfIdfTransform in edu.ucla.sspace.matrix:
public double transform(int row, int column, double value) { double tf = value / docTermCount[column]; double idf = Math.log(totalDocCount / (termDocCount[row] + 1)); return tf * idf; }
the same in the following method: public double transform(int row, DoubleVector column) { ... }
Cheers, Luboš