JuliaStats / Clustering.jl

A Julia package for data clustering
Other
353 stars 117 forks source link

Sparse matrix support in hierarchical clustering #173

Open hecking opened 5 years ago

hecking commented 5 years ago

I ran into an Out of memory error when applying single linkage hierarchical clustering on a large sparse distance matrix.

I figured out that the problem is in the first line of the hclust_minimum function

function hclust_minimum(ds::AbstractMatrix{T}) where T<:Real
     d = Matrix(ds)

This creates a dense matrix from any kind of input distance matrix. If ds is a sparse matrix that still fits into the memory, this line can crash the procedure if the dense counterpart of df becomes to big for the working memory.

This issue could easily be solved using a SparseMatrixCSC matrix from the SparseArray module.

function hclust_minimum(ds::AbstractMatrix{T}) where T<:Real
     d = SparseMatrixCSC(ds)

I think for the other types of hierarchical clustering it will be similar.

alyst commented 5 years ago

hclust_minimum() method really uses the d in the "dense" way updating all elements of one row/col per iteration, so I'm not sure that

Actually, it's somewhat intriguing to have a sparse distance matrix. Does it mean that most of your datapoints are duplicates (zero distances)? Maybe for your clustering problem you can consider removing the duplicates first, and then running hclust() with the reduced distance matrix that only contains the distances between distinct points?

hecking commented 5 years ago

Regarding the updating of the distance matrix there is no difference between sparse or dense storage.

I agree that sparse distance matrices are rare but this can happen with non-metric distances. In my case I have to deal with very large matrices where distances are not known (needed) between many pairs of data points. In these cases the distance is set to 0. The distance has a negative value for more similar data points, e.g. for duplicates the distance would be -1.

alyst commented 5 years ago

Thanks for the clarifications. I think we definitely should not do a hard switch from Matrix to SparseMatrixCSC, because it will kill the performance for most of the use cases.

But you can try introducing the type parameter that defines which matrix format to use. Something like this:

function hclust_minimum(::Type{M}, ds::AbstractMatrix{T}) where {M<:AbstractMatrix, T<:Real}
     d = M(ds)
    ....
end

And sparse keyword argument to hclust() (hclust(..., sparse=false/true)) so that if sparse=true, it calls hclust_minimum(SparseMatrixCSC, ...) and hclust_minimum(Matrix, ...) otherwise. You can submit the PR introducing this change, but I personally consider this approach highly experimental.

hecking commented 5 years ago

Thank you for the suggestions. I think it can be a quite flexible solution.