Closed VPetukhov closed 5 years ago
Because the ncounts object is derived like this:
ncounts <- exprs(input_cds)
ncounts <- ncounts[Matrix::rowSums(ncounts) != 0,]
I think the implementation does count the number of documents... and the ncol should be counting, not summing counts. Does that make sense?
On your pull request and other issues, I'm hoping to get to them early next week.
Yes, ncol
is counting, but rowSums(cm)
gives you sum count over all documents, but not number of documents containing this word.
It seems like it's the same... unless I'm missing something
> tfidf_current <- function(input_cds) {
+ ncounts <- counts(input_cds)
+ ncounts <- ncounts[Matrix::rowSums(ncounts) != 0,]
+ nfreqs <- ncounts
+ nfreqs@x <- ncounts@x / rep.int(Matrix::colSums(ncounts), diff(ncounts@p))
+ tf_idf_counts <- nfreqs * log(1 + ncol(ncounts) / Matrix::rowSums(ncounts))
+ Matrix::t(tf_idf_counts)
+ }
> tfidf_prop <- function(input_cds) {
+ ncounts <- counts(input_cds)
+ ncounts <- ncounts[Matrix::rowSums(ncounts) != 0,]
+ nfreqs <- ncounts
+ nfreqs@x <- ncounts@x / rep.int(Matrix::colSums(ncounts), diff(ncounts@p))
+ tf_idf_counts <- nfreqs * log(1 + ncol(ncounts > 0) / Matrix::rowSums(ncounts))
+ Matrix::t(tf_idf_counts)
+ }
> x <- tfidf_current(test_cds)
> y <- tfidf_prop(test_cds)
> identical(x, y)
[1] TRUE
It's not
tf_idf_counts <- nfreqs * log(1 + ncol(ncounts > 0) / Matrix::rowSums(ncounts))
but
tf_idf_counts <- nfreqs * log(1 + ncol(ncounts) / Matrix::rowSums(ncounts > 0))
Ah, I see what you're saying. Yes, that's a bug - a relic of using binary matrices when it's the same. Fixed now
In the
tfidf
function you uselog(1 + ncol(ncounts) / Matrix::rowSums(ncounts))
as the idf term. Though in classic idf you need to normalize on number of documents, but not total counts over documents:log(1 + ncol(ncounts > 0) / Matrix::rowSums(ncounts))
. Does it have internal meaning or is it just a bug?In the second case, the whole function can be optimized with 2-fold speed up (using "inverse document frequency smooth" method):
Just in case, the logic of the function above exactly corresponds to the following call of tfidf from
quanteda
package (only works faster):