SLU-TMI / TextMining.jl

Other
24 stars 7 forks source link

Using available Julia ML packages #90

Open wildart opened 9 years ago

wildart commented 9 years ago

Why not to use available packages:

mtabor150 commented 9 years ago

These all work with matrices. Since we are working in a space with millions of different words (dimensions) but each text may only have a couple thousand of them, their matrix representation would be extremely sparse. To mitigate this, we are building tools to run these algorithms with a FeatureVector, which is a representation of an infinite dimensional vector built using a hashtable. If a FeatureVector has a non-zero value in a dimension we keep it's key->value pair. Otherwise we return 0 for all keys not kept in the hash.

wildart commented 9 years ago

Using dictionaries guaranties you poor performance and high memory consumption. You could use sparse matrices to capture word frequencies. Look how TextAnalysis.jl does it. Internal matrix representation allows to use variety of numerical algorithms for text analysis, e.g. SVD for LSA. Be careful, poor and unconventional design choices could create problems down the road. BTW, do not hesitate to ask questions on julia-user group. There are plenty of people which could give a good hit, point into right direction and even help with development and testing.

mtabor150 commented 9 years ago

TextAnalysis.jl seems to be using dictionaries as well.

type DocumentTermMatrix
    dtm::SparseMatrixCSC{Int, Int}
    terms::Vector{UTF8String}
    column_indices::Dict{UTF8String, Int}
end

Base.getindex(dtm::DocumentTermMatrix, k::String) = dtm.dtm[:, dtm.column_indices[k]]

What is the benefit of storing values in a sparse matrix and using a dictionary to store indicies? This seems to just be adding extra levels of indirection. Also, the way TextAnalysis.jl is built it appears you must have the whole corpus prior to performing analysis, which our approach avoids. It also seems that most packages do not support sparse vectors.

wildart commented 9 years ago

Currently, there are many issues with support of sparse matrices. Until they aren't resolved, it is quite problematic to work with text datasets. Nevertheless, relying on dictionaries is not a good choice for performance reasons. Distance calculations can be way more efficient using a sparse matrix representation then a dictionary. Many machine learning algorithms use linear algebra under the hood that means that you'll be forced to convert your FeatureVectors into the matrix form. Why not to use it from the beginning (and get performance boost as well)?

As for DocumentTermMatrix, dictionary is used only addressing, not for storing feature values. It it more concise representation especially when using very large dataset which is any text corpus.

mtabor150 commented 9 years ago

Still, it doesn't address the issue of unseen text. If we keep a sparse vector for each text and want to preform an analysis on a text with previously unseen words, we would (seemingly) have to modify all existing vectors to correctly match dimensions.

julia> x = [0,1,3,0]
4-element Array{Int64,1}:
 0
 1
 3
 0

julia> y = [1,0,0,3,7]
5-element Array{Int64,1}:
 1
 0
 0
 3
 7

julia> z = x+y
ERROR: dimensions must match
 in + at array.jl:719

julia> x = sparse(x)
4x1 sparse matrix with 2 Int64 entries:                                                                                       
        [2, 1]  =  1                                                                                                          
        [3, 1]  =  3                                                                                                          

julia> y = sparse(y)
5x1 sparse matrix with 3 Int64 entries:                                                                                       
        [1, 1]  =  1                                                                                                          
        [4, 1]  =  3
        [5, 1]  =  7

julia> z = x+y
ERROR: DimensionMismatch("")
 in + at sparse/sparsematrix.jl:542

Since we are using the non-zero dimensions of infinite dimensional vectors we avoid this issue.

julia> x = FeatureVector([2 => 1, 3 => 3])
FeatureVector{Int64,Int64} with 2 features:
    (3,3)
    (2,1)

julia> y = FeatureVector([1 => 1, 4 => 3, 5 => 7])
FeatureVector{Int64,Int64} with 3 features:
    (5,7)
    (4,3)
    (1,1)

julia> z = x+y
FeatureVector{Int64,Int64} with 5 features:
    (5,7)
    (4,3)
    (3,3)
    (2,1)
    (1,1)

For us this question then becomes is it better to rebuild tens of thousands of sparse vectors when encountering new vocabulary? Instead of that, maybe we should be looking at a more efficient associative structure like a JudyArray or Trie?

wildart commented 9 years ago

Well, you could allocate very large sparse vectors in advance which costs nothing.

julia> @time x = sparsevec([2,3], [1,3],1000000000)
elapsed time: 4.9062e-5 seconds (944 bytes allocated)
1000000000x1 sparse matrix with 2 Int64 entries:
    [2         ,          1]  =  1
    [3         ,          1]  =  3
julia> y = sparsevec([1,4,5],[1,3,7],1000000000)
1000000000x1 sparse matrix with 3 Int64 entries:
    [1         ,          1]  =  1
    [4         ,          1]  =  3
    [5         ,          1]  =  7
julia> x+y
1000000000x1 sparse matrix with 5 Int64 entries:
    [1         ,          1]  =  1
    [2         ,          1]  =  1
    [3         ,          1]  =  3
    [4         ,          1]  =  3
    [5         ,          1]  =  7

Or reshape vector on fly, even though it is a hack.

julia> x = sparsevec([2,3], [1,3])
3x1 sparse matrix with 2 Int64 entries:
    [2, 1]  =  1
    [3, 1]  =  3
julia> y = sparsevec([1,4,5],[1,3,7])
5x1 sparse matrix with 3 Int64 entries:
    [1, 1]  =  1
    [4, 1]  =  3
    [5, 1]  =  7
julia> x.m = y.m
5
julia> x+y
5x1 sparse matrix with 5 Int64 entries:
    [1, 1]  =  1
    [2, 1]  =  1
    [3, 1]  =  3
    [4, 1]  =  3
    [5, 1]  =  7
mtabor150 commented 9 years ago

Until the functionality of sparse matrices is improved I don't think using them is feasible for such large datasets.

 julia> x
1000000000x1 sparse matrix with 2 Int64 entries:
        [2         ,          1]  =  1
        [3         ,          1]  =  3

julia> y
1000000000x1 sparse matrix with 3 Int64 entries:
        [1         ,          1]  =  1
        [4         ,          1]  =  3
        [5         ,          1]  =  7

julia> euclidean(x,y)
ERROR: `euclidean` has no method matching euclidean(::SparseMatrixCSC{Int64,Int64}, ::SparseMatrixCSC{Int64,Int64})

julia> euclidean(full(x),full(y))
ERROR: MemoryError()
 in full at sparse/sparsematrix.jl:173

julia> norm(x-y)
ERROR: MemoryError()
 in full at sparse/sparsematrix.jl:173
 in norm at linalg/sparse.jl:487
 in norm at linalg/sparse.jl:482

julia> z = x-y
1000000000x1 sparse matrix with 5 Int64 entries:
        [1         ,          1]  =  -1
        [2         ,          1]  =  1
        [3         ,          1]  =  3
        [4         ,          1]  =  -3
        [5         ,          1]  =  -7

julia> norm(z)
ERROR: MemoryError()
 in full at sparse/sparsematrix.jl:173
 in norm at linalg/sparse.jl:487
 in norm at linalg/sparse.jl:482
mtabor150 commented 9 years ago

Even if we use a length much closer to our dataset the conversion from sparse to dense matrices is prohibitive:

julia> @time fv = load_featurevector("./Machine\ Learning/eebo/XML_with_headers_attached/headed-xml-200111/A00648.headed.xml")
elapsed time: 0.595232567 seconds (14139436 bytes allocated)
FeatureVector{Any,Number} with 1677 features:
  Top 5 Features:
    ("thou",154)
    ("i",152)
    ("thy",145)
    ("the",137)
    ("to",136)

julia> @time fv.length
elapsed time: 4.987e-6 seconds (80 bytes allocated)
464.3242401598276

julia> @time x = sparse(collect(values(fv)))
elapsed time: 0.149458042 seconds (2743704 bytes allocated)
1677x1 sparse matrix with 1677 Number entries:
        [1   ,    1]  =  1
        [2   ,    1]  =  1
        [3   ,    1]  =  2
        [4   ,    1]  =  1
        [5   ,    1]  =  2
        [6   ,    1]  =  1
        [7   ,    1]  =  1
        [8   ,    1]  =  1
        [9   ,    1]  =  1
        [10  ,    1]  =  3
        [11  ,    1]  =  1
        [12  ,    1]  =  1
        [13  ,    1]  =  1
        [14  ,    1]  =  1
        [15  ,    1]  =  1
        [16  ,    1]  =  1
        [17  ,    1]  =  1
        [18  ,    1]  =  1
        [19  ,    1]  =  1
        [20  ,    1]  =  1
        ⋮
        [1657,    1]  =  1
        [1658,    1]  =  1
        [1659,    1]  =  3
        [1660,    1]  =  2
        [1661,    1]  =  1
        [1662,    1]  =  2
        [1663,    1]  =  1
        [1664,    1]  =  3
        [1665,    1]  =  1
        [1666,    1]  =  2
        [1667,    1]  =  1
        [1668,    1]  =  1
        [1669,    1]  =  1
        [1670,    1]  =  2
        [1671,    1]  =  1
        [1672,    1]  =  1
        [1673,    1]  =  1
        [1674,    1]  =  2
        [1675,    1]  =  1
        [1676,    1]  =  21
        [1677,    1]  =  1

julia> @time x.m = 10000000
elapsed time: 7.942e-6 seconds (96 bytes allocated)
10000000

julia> @time norm(x)
elapsed time: 9.08191248 seconds (720000208 bytes allocated, 9.42% gc time)
464.3242401598269
wildart commented 9 years ago

It is not correct comparison, you create a sparse matrix of the same dimensionality as a dense matrix from which sparse matrix was created. This is completely diminishes purpose of using sparse matrix, just use dense one. What I was trying to say that, look at DocumentTermMatrix, if your domain is not limited by one document, then collection of the documents (features) is easy to represent as a sparse matrix where columns are documents and rows are words from all documents (word domain of whole corpus). This corresponds, in your case, to a Cluster structure not a FeatureVector. As soon as all your documents bundled in a one matrix, you gain all speedups when computing distances between documents. It basically narrows down to a multiplication of sparse matrices that are more efficient then multiple loops and lookups in dictionaries.