JuliaAI / MLJText.jl

A an MLJ extension for accessing models and tools related to text analysis
MIT License
11 stars 1 forks source link

Accessing elements in data structure produced by `CountTransformer` is quite slow #29

Open roland-KA opened 5 months ago

roland-KA commented 5 months ago

I've used the CountTransformer to produce a word frequency matrix as follows:

CountTransformer = @load CountTransformer pkg=MLJText
trans_machine = machine(CountTransformer(), doc_list)
fit!(trans_machine)
X1 = transform(trans_machine, doc_list)

Then a function word_count has been applied to X1 (it aggregates the numbers in X1 for doing Naive Bayes; i.e. each element of X1 is accessed once).

This takes about 245 seconds (on a M1 iMac); the size of X1 is (33716, 159093).

If I produce the word frequency matrix using TextAnalysis directly as follows:

crps = Corpus(TokenDocument.(doc_list))
update_lexicon!(crps)
m = DocumentTermMatrix(crps)
X2 =  dtm(m)

... then word_count runs in about 16.7 sec on matrix X2. So accessing the elements of X1 is almost 15 times slower than to X2.

The difference between the two is, that X2 is a "pure" SparseMatrix whereas X1 is of type LinearAlgebra.Adjoint{Int64, SparseMatrixCSC{Int64, Int64}}. I didn't find any information on how this data structure is represented in Julia.

Therefore I have a few questions:

With these findings, it is of course not recommendable to use CountTransformer for this purpose ... or did I miss something?

ablaom commented 4 months ago

I've tried to extract the "pure" SparseMatrix from X1 using X3 = X1[1:end, 1:end]. But this takes almost 364 sec. Is there a faster way to get it?

The adjoint is just the conjugate-transpose as a view. So applying it twice returns the original, unwrapped, matrix (in this case sparse). So, what if you take the adjoint of X1 (adjoint(X1) or X1') and access it with rows and column indices reversed?

roland-KA commented 4 months ago

Ah thanks, that's a good idea! I've tried it and it is indeed faster. Instead of 245 sec. the word_count finishes within 43 sec. But it is still slower than using the SparseMatrix produced by TextAnalysis (16.7 sec).

What is the reason for producing an adjoint in CountTransformer? Was the CountTransformer easier to implement or are there any applications which prefer this structure for further processing?

ablaom commented 3 months ago

The reason for the adjoint is because it is lazy but we need to observe MLJ's convention that observations are rows. Given that adjoint is lazy, I admit to being puzzled as to why you're still seeing such a slowdown and agree it would be good to understand why.

roland-KA commented 3 months ago

Well, being lazy is perhaps a big part of the explanation. word_count accesses all elements of the matrix. So this is the worst use case of a lazy evaluation.