JuliaGraphs / LightGraphsExtras.jl

Additional functionality for LightGraphs.jl
Other
21 stars 13 forks source link

Kmeans issue #18

Closed ollin18 closed 6 years ago

ollin18 commented 7 years ago

There is a problem using the Clustering.jl's kmeans algorithm for the NBM community detection because we have to cluster in k groups with k-1 vectors, and as you can see that is not possible.

I've been using the R kmeans functions with the RCall package, and works perfectly but adds a dependency.

function kmeans!{T<:AbstractFloat}(X::Matrix{T}, centers::Matrix{T};
                                   weights=nothing,
                                   maxiter::Integer=_kmeans_default_maxiter, 
                                   tol::Real=_kmeans_default_tol,
                                   display::Symbol=_kmeans_default_display)

    m, n = size(X)
    m2, k = size(centers)
    m == m2 || throw(DimensionMismatch("Inconsistent array dimensions."))
    (2 <= k < n) || error("k must have 2 <= k < n.") # <- THIS LINE

    assignments = zeros(Int, n)
    costs = zeros(T, n)
    counts = Array(Int, k)
    cweights = Array(Float64, k)

    _kmeans!(X, conv_weights(T, n, weights), centers, 
             assignments, costs, counts, cweights, 
             round(Int, maxiter), tol, display_level(display))
end
CarloLucibello commented 7 years ago

There is a problem using the Clustering.jl's kmeans algorithm for the NBM community detection because we have to cluster in k groups with k-1 vectors, and as you can see that is not possible.

kmeans in the community detection algorithm is used to cluster into k groups n vectors with dimensionality k-1, why this should be a problem? kmeans allows for arbitrary dimensionality of the vectors, and k-1 is the right value if there are k clear cut communities, since this information is embedded in the in the eigenvectors from 2 to k (ordered by the magnitude of the eigenvalue)

ollin18 commented 7 years ago

Ok, we want to cluster in k groups with k-1 eigenvectors. Each of these eigenvectors have either n (number of vertices), 2n, or m (number of edges) entries. Since the kmeans function has the line that I pointed out, we can't cluster in k groups with k-1 eigenvectors. I haven't studied your script yet but I tested it with the Zachary's network and I don't quite get the real communities, with neither the kmeans or the R_kmeans.

http://nbviewer.jupyter.org/github/ollin18/ML-ITAM/blob/master/Communities_NBM.ipynb

I have witten a couple of scripts for this particular method, and other nonbacktracking ones like Newman's flux matrix and the reluctant operator, they are not as elegant as yours but they give good partitions. They are all here: https://github.com/ollin18/All-InNet if you want to check them out (They are poorly written and some parts in spanish).

I think we can build something good together.

ollin18 commented 7 years ago

Also, the norm of some complex eigenvalues are greater than some of the real ones that we can use. Only the real eigenvalues outside the bulk (Wigner's semicircle law) are useful to find communities.

CarloLucibello commented 7 years ago

Ok, we want to cluster in k groups with k-1 eigenvectors. Each of these eigenvectors have either n (number of vertices), 2n, or m (number of edges) entries. Since the kmeans function has the line that I pointed out, we can't cluster in k groups with k-1 eigenvectors.

Ah ok, maybe I see the point, there is some confusion. I use k-1 eigenvectors of dimension n, but when i give them to kmeans i transpose the matrix, so that I cluster n vectors of size k-1, so we should be good there.

I see from the notebook that the non-bactracking methods perfoms very badly on the zachary net indeed, maybe there is no outside the bulk eigenvalue (except for the first one which is correlated to the degrees). More probably though there is some bug in the function, I'll check that

I just pushed here another spectral method based in the bethe hessian, that has some advantages over the non-bactraking, one of which is that relevant eigenvalues are immediately recognized as the negative ones. Could you try that on zachary net?

It would be nice to port all your methods here, please open a pr if you wish and have the time

ollin18 commented 7 years ago

I've tested the Bethe method and it clusters fine the Zachary's, have some troubles with the dolphins' and seems about right with the football one (gives 10 partitions, it's commonly divided into 10-12). Here's the nb:

http://nbviewer.jupyter.org/github/ollin18/ML-ITAM/blob/master/Bethe%20test.ipynb

Also, I found my first sketches of the NBM community detection algorithms. They're not well written but as they are all together I wanted to show them to you.

http://nbviewer.jupyter.org/github/ollin18/ML-ITAM/blob/master/Con%20deltas.ipynb

I'll put some work on the algorithms and open a pr anytime soon.

sbromberger commented 7 years ago

@ollin18 were you planning on opening a PR for this work? Should I close this issue?

ollin18 commented 7 years ago

Yes, I'm planning on opening a PR just need a couple of day to pulish the code

sbromberger commented 6 years ago

Closing this out since code is now in CommunityDetection.jl.