flashxio / FlashX

FlashX is a collection of big data analytics tools that perform data analytics in the form of graphs and matrices.
http://flashx.io/
Apache License 2.0
231 stars 53 forks source link

implement a (nearly) linear time clustering algorithm #58

Open jovo opened 9 years ago

jovo commented 9 years ago

some potential options:

-- Near linear time algorithm to detect community structures in large-scale networks -- louvain -- Multi-Threaded Modularity Based Graph Clustering using the Multilevel Paradigm -- A Local Clustering Algorithm for Massive Graphs and its Application to Nearly-Linear Time Graph Partitioning

@icoming has mentioned some other potentially worth implementing. let's populate a list of potentials, and see whether we want to implement any of them?

note that i could imagine something like the following working very well: (1) find k-core for some reasonable k (eg, remove all 1-degree & 2-degree nodes) (2) augment the adjacency matrix by adding something along the diagonal (3) random projection of resulting "adjacency" matrix (4) k-means on the result

might be competitive. note that (3)+(4) is a kind of randomized spectral clustering. i bet we could prove that (1)+(2)+(3)+(4) is optimal (and robust) with high probability under a (degree-corrected) SBM...

zheng-da commented 9 years ago

If we only remove 1-degree and 2-degree nodes, we probably can't remove many vertices from a graph. For example, there are 1,108,979,537 vertices that has degree smaller than 10 in the page graph. Even if we delete all vertices whose degree is smaller than 10, we can't speed up spectral clustering on the remaining vertices by much.

jovo commented 9 years ago

ok, so maybe that is a bad idea....i imagine some of the others are less bad....

On Wed, Dec 10, 2014 at 7:53 PM, Da Zheng notifications@github.com wrote:

If we only remove 1-degree and 2-degree nodes, we probably can't remove many vertices from a graph. For example, there are 1,108,979,537 vertices that has degree smaller than 10 in the page graph. Even if we delete all vertices whose degree is smaller than 10, we can't speed up spectral clustering on the remaining vertices by much.

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/58#issuecomment-66552601.

it's our dream openconnecto.me, jovo.me, my calendar https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York

zheng-da commented 9 years ago

According to "Accelerating Community Detection by Using K-core Subgraphs", the method you described is pretty effective, but I guess its effectiveness should vary from graphs to graphs. Nevertheless, we need to use much larger K. For a graph with smaller than 1M vertices, K=10 might be enough. For a graph with billions of vertices, we might need to use K=100. For example, 3,128,386,131 vertices in the page graph has degree smaller than 100. The question is how well the method can work when K is as large as 100.

zheng-da commented 9 years ago

I have assigned Louvain and Nerstrand (the third clustering algorithm) to @dmhembe1 . I hope we can have at least one of them. It might be very difficult to debug and verify the code for these two implementations. I'm not sure how useful the fourth clustering algorithm is. It always partitions a graph into two parts.

jovo commented 9 years ago

sounds good.

zheng-da commented 9 years ago

I was mistaken. The vertex degree is different from k-core. We can't tell how many vertices in 1-core, 2-core, etc from vertex degree. Let's do some experiments before reaching a conclusion.