Open roland-KA opened 5 months ago
I've tried to extract the "pure" SparseMatrix from X1 using X3 = X1[1:end, 1:end]. But this takes almost 364 sec. Is there a faster way to get it?
The adjoint is just the conjugate-transpose as a view. So applying it twice returns the original, unwrapped, matrix (in this case sparse). So, what if you take the adjoint of X1
(adjoint(X1)
or X1'
) and access it with rows and column indices reversed?
Ah thanks, that's a good idea! I've tried it and it is indeed faster. Instead of 245 sec. the word_count
finishes within 43 sec. But it is still slower than using the SparseMatrix
produced by TextAnalysis
(16.7 sec).
What is the reason for producing an adjoint in CountTransformer
? Was the CountTransformer
easier to implement or are there any applications which prefer this structure for further processing?
The reason for the adjoint is because it is lazy but we need to observe MLJ's convention that observations are rows. Given that adjoint is lazy, I admit to being puzzled as to why you're still seeing such a slowdown and agree it would be good to understand why.
Well, being lazy is perhaps a big part of the explanation. word_count
accesses all elements of the matrix. So this is the worst use case of a lazy evaluation.
I've used the
CountTransformer
to produce a word frequency matrix as follows:Then a function
word_count
has been applied toX1
(it aggregates the numbers inX1
for doing Naive Bayes; i.e. each element ofX1
is accessed once).This takes about 245 seconds (on a M1 iMac); the size of
X1
is (33716, 159093).If I produce the word frequency matrix using
TextAnalysis
directly as follows:... then
word_count
runs in about 16.7 sec on matrixX2
. So accessing the elements ofX1
is almost 15 times slower than toX2
.The difference between the two is, that
X2
is a "pure"SparseMatrix
whereasX1
is of typeLinearAlgebra.Adjoint{Int64, SparseMatrixCSC{Int64, Int64}}
. I didn't find any information on how this data structure is represented in Julia.Therefore I have a few questions:
X1
faster (or rather: why is that so slow)?SparseMatrix
fromX1
usingX3 = X1[1:end, 1:end]
. But this takes almost 364 sec. Is there a faster way to get it?With these findings, it is of course not recommendable to use
CountTransformer
for this purpose ... or did I miss something?