Potential bug in implementation of tfidf

cole-trapnell-lab / garnett

Automated cell type classification

MIT License

107 stars 26 forks source link

Potential bug in implementation of tfidf #31

Closed VPetukhov closed 5 years ago

VPetukhov commented 5 years ago

In the tfidf function you use log(1 + ncol(ncounts) / Matrix::rowSums(ncounts)) as the idf term. Though in classic idf you need to normalize on number of documents, but not total counts over documents: log(1 + ncol(ncounts > 0) / Matrix::rowSums(ncounts)). Does it have internal meaning or is it just a bug?

In the second case, the whole function can be optimized with 2-fold speed up (using "inverse document frequency smooth" method):

tfidf <- function(input_cds) {
  nfreqs <- exprs(input_cds)
  nfreqs@x <- nfreqs@x / rep.int(Matrix::colSums(nfreqs), diff(nfreqs@p))
  nfreqs <- Matrix::t(nfreqs)
  nfreqs@x <- nfreqs@x * rep(log(1 + nrow(nfreqs) / 
      (Matrix::colSums(nfreqs > 0) + 1)), diff(nfreqs@p))
  return(nfreqs)
}

Just in case, the logic of the function above exactly corresponds to the following call of tfidf from quanteda package (only works faster):

quanteda::as.dfm(exprs(input_cds))
quanteda::dfm_tfidf(dfq, scheme_tf="prop", smoothing=1, base=exp(1))

hpliner commented 5 years ago

Because the ncounts object is derived like this:

ncounts <- exprs(input_cds)
ncounts <- ncounts[Matrix::rowSums(ncounts) != 0,]

I think the implementation does count the number of documents... and the ncol should be counting, not summing counts. Does that make sense?

On your pull request and other issues, I'm hoping to get to them early next week.

VPetukhov commented 5 years ago

Yes, ncol is counting, but rowSums(cm) gives you sum count over all documents, but not number of documents containing this word.

hpliner commented 5 years ago

It seems like it's the same... unless I'm missing something

> tfidf_current <- function(input_cds) {
+   ncounts <- counts(input_cds)
+   ncounts <- ncounts[Matrix::rowSums(ncounts) != 0,]
+   nfreqs <- ncounts
+   nfreqs@x <- ncounts@x / rep.int(Matrix::colSums(ncounts), diff(ncounts@p))
+   tf_idf_counts <- nfreqs * log(1 + ncol(ncounts) / Matrix::rowSums(ncounts))
+   Matrix::t(tf_idf_counts)
+ }
> tfidf_prop <- function(input_cds) {
+   ncounts <- counts(input_cds)
+   ncounts <- ncounts[Matrix::rowSums(ncounts) != 0,]
+   nfreqs <- ncounts
+   nfreqs@x <- ncounts@x / rep.int(Matrix::colSums(ncounts), diff(ncounts@p))
+   tf_idf_counts <- nfreqs * log(1 + ncol(ncounts > 0) / Matrix::rowSums(ncounts))
+   Matrix::t(tf_idf_counts)
+ }
> x <- tfidf_current(test_cds)
> y <- tfidf_prop(test_cds)
> identical(x, y)
[1] TRUE

VPetukhov commented 5 years ago

It's not

tf_idf_counts <- nfreqs * log(1 + ncol(ncounts > 0) / Matrix::rowSums(ncounts))

but

tf_idf_counts <- nfreqs * log(1 + ncol(ncounts) / Matrix::rowSums(ncounts > 0))

hpliner commented 5 years ago

Ah, I see what you're saying. Yes, that's a bug - a relic of using binary matrices when it's the same. Fixed now