fozziethebeat / S-Space

The S-Space repsitory, from the AIrhead-Research group
GNU General Public License v2.0
203 stars 106 forks source link

TFIDF calculated not correctly? #41

Open lubomirkrcmar opened 11 years ago

lubomirkrcmar commented 11 years ago

Hi SSpace team,

I believe, TFIDF is not calculated correctly. Am I right?

Why the value in the following is divided by docTermCount[column]? I think, there should be just tf = value; since tf stands for term frequency in a certain document and does not stand for probability of the term frequency in the document (division case). At least, wikipedia reffered in the code claims so.

class TfIdfTransform in edu.ucla.sspace.matrix:

public double transform(int row, int column, double value) { double tf = value / docTermCount[column]; double idf = Math.log(totalDocCount / (termDocCount[row] + 1)); return tf * idf; }

the same in the following method: public double transform(int row, DoubleVector column) { ... }

Cheers, Luboš

davidjurgens commented 11 years ago

Hi Luboš,

You bring up a good point. Our implementation of TF-IDF is using the term's probability in the document, rather than its frequency. Using the probability discounts the impact of different sized documents where the frequencies for a single term may differ significantly.

There are several ways to adjust the TF value (the Wikipedia page mentions a few others, as well), but I don't think our docs mention anywhere which one we're using. It would be pretty helpful to be able to adjust the TF transform as well. I don't think this is too much work, so if you want, I'm happy to try extending the code with a few options.

Thanks, David

On Mon, May 20, 2013 at 3:13 PM, lubomirkrcmar notifications@github.comwrote:

Hi SSpace team,

I believe, TFIDF is not calculated correctly. Am I right?

Why the value in the following is divided by docTermCount[column]? I think, there should be just tf = value; since tf stands for term frequency in a certain document and does not stand for probability of the term frequency in the document (division case). At least, wikipedia reffered in the code claims so.

class TfIdfTransform in edu.ucla.sspace.matrix:

public double transform(int row, int column, double value) { double tf = value / docTermCount[column]; double idf = Math.log(totalDocCount / (termDocCount[row] + 1)); return tf * idf; }

the same in the following method: public double transform(int row, DoubleVector column) { ... }

Cheers, Luboš

— Reply to this email directly or view it on GitHubhttps://github.com/fozziethebeat/S-Space/issues/41 .

lubomirkrcmar commented 11 years ago

Hi David,

I understand, thanks. I do some experiments in Informational Retrieval and yes, I would welcome more variants of the TfIdf Transform. Also, it would be nice to modify the comment in the currently implemented variant - I mean to include that tf is normalized.

Manning in Introduction to Information Retrieval (2008) besides from "term frequency" and "document frequency" weighting writes about "normalization weighting": 1/u (pivoted normalization) corresponds to what You use in Your tfIDf implementation in SSPace, I believe.

Currently (-term frequency w., document frequency w., and normalization weighting-), in Your Project, the NoTransform class corresponds to -tf, no, noNorm-, TfIdfTransform to -tf, Idf, pivotNorm- and LogEntropyTransform to -tf, Entropy, NoNorm-. There are also some other classes, which I do not know well yet, such as PointwiseMutualnformationTransform and LogLikelihoodTransform.

I believe TfIdfNoNormTransform and LogIDfNoNormTranform might be interesting candidates for implementation. Maybe, also sqrt weighting of term freqency is interesting (Lucene library uses sqrt.).

Thanks for SSpace! Luboš