darrenstrash / quick-cliques

Quickly compute all maximal cliques of a graph
GNU General Public License v3.0
73 stars 23 forks source link

Size of graph #2

Open sjackman opened 6 years ago

sjackman commented 6 years ago

What's the largest graph (number of vertices/edges) for which you've successfully run quick-cliques?

darrenstrash commented 6 years ago

Hi Shaun,

Thanks for your question. The algorithms have been run on a variety of sizes; however, not all algorithms support all sizes. The 'tomita' algorithm will only run on graphs with 10K to 100K nodes (depending on your system), as it uses an adjacency matrix -- so memory will be in issue.

For the other algorithms, I ran on sparse graphs with a few million nodes. Others have run the degeneracy algorithm on tens of millions of nodes (50M in "pbitMCE: A bit-based approach for maximal clique enumeration on multicore processors", https://doi.org/10.1109/PADSW.2014.7097844). But this requires a machine with high memory (with 4GB of memory, the algorithm runs out of memory: "Fast Algorithms for Maximal Clique Enumeration with Limited Memory", http://wan.poly.edu/KDD2012/docs/p1240.pdf ). What size graphs are you considering?

Best, Darren

On Thu, Jun 7, 2018 at 4:28 PM, Shaun Jackman notifications@github.com wrote:

What's the largest graph (number of vertices/edges) for which you've successfully run quick-cliques?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/darrenstrash/quick-cliques/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AItfZCR5MrtrBsuWBLz2aWxX9NdGyuEzks5t6YzhgaJpZM4UfEMO .

sjackman commented 6 years ago

Hi, Darren. Thanks for your response. Really rough ballpark numbers, one million vertices, and ~1,000 edges per vertex, so one billion edges. I'm interested in enumerating all the k-cliques for small values of k. Not sure how small yet, in the neighbourhood of 3 to 9. Our largest machines have ~2.5 TB of RAM, and commodity machines have 384 GB of RAM. Does it seem possible?

sjackman commented 6 years ago

@warrenlr Quick Cliques is a tool to "Quickly compute all maximal cliques of a graph".

sjackman commented 6 years ago

@darrenstrash Knowing the size and content of the largest clique may also be interesting. Is it possible for this large a graph?

darrenstrash commented 6 years ago

@sjackman: The memory should be no problem for your machines. I think it should even work on a commodity machine; However, based on your requirements, quick-cliques may not be the right tool, and whether or not it (and other tools) work well, will depend on whether the graph is uniformly sparse or not. If not, there are even graphs of size 500 nodes that won't complete in a reasonable length of time because there are so many cliques.

Maximum clique

Finding a maximum clique in such a large graph is easy, as long as the graph is uniformly sparse; low degeneracy is one indicator, but is not a guarantee (you can compute the degeneracy with compdegen in quick-cliques). If there are fairly dense subgraphs, then it may not be so easy. There is an algorithm by Pablo San Segundo and others, BBMCSP, that will handle that graph easily if it is uniformly sparse. There is a link to a linux executable as reference [33] in the paper, here: http://venus.elai.upm.es/logs/results_sparse/.

Clique enumeration

Now, for listing cliques, let's see if maximal clique listing is what you really want. quick-cliques lists all maximal cliques. I can quickly evaluate if it can be easily modified to do what you're looking for. Please answer the following:

  1. Do you want to compute all cliques of a single size k, or for a range of sizes?
  2. Do you only want to compute maximal cliques (cliques not contained in a larger clique)? Or do you just want any cliques of the given size? There are potentially many more cliques than there are maximal cliques.

Again, uniform density will play a role here. If the graph is not uniformly sparse then listing all cliques (even maximal ones) may not be feasible.

sjackman commented 6 years ago

Our graph is weighted, with the weight each edge indicating the similarity of the two vertices. We can if necessary retain only those n edges for each vertex that are the most similar to its neighbours, to reduce the maximum degree of all vertices to a specified threshold.

Do you only want to compute maximal cliques (cliques not contained in a larger clique)? Or do you just want any cliques of the given size? There are potentially many more cliques than there are maximal cliques.

Maximal cliques would be better I think, if only to reduce the size of the output. If for example there is a 5-clique, it seems more efficient to report that 5-clique than the five 4-cliques within that 5-clique.

Do you want to compute all cliques of a single size k, or for a range of sizes?

For the above reason, I think reporting for a range of sizes is better.

darrenstrash commented 6 years ago

Ok, then quick-cliques should work for you, but its speed will still depend on the structure of the graph. If you want to list all maximal cliques, that would be easiest (Just activate the preprocessor definition LIST_CLIQUES_ONE_BY_ONE in the makefile). Then it will write each maximal clique, one-per-line, to standard output. Use the degeneracy algorithm with the --algorithm=degeneracy flag; it should be fastest.

If you want to only write out maximal cliques of a certain size, it's easy to include an if statement in the lambda function printClique on line 226 of main.cpp. Such as

if (clique.size < smallest_clique_size || clique.size() > largest_clique_size) return

That should give you what you're looking for.

If it's too slow (more than an hour), updating the if statement on line 772 of DegeneracyAlgorithm.cpp to

if (beginP >= beginR || partialClique.size() >= largest_clique_size) return

If it's still too slow, then maybe the structure of the graph prevents efficient listing with quick-cliques. If that's the case, please let me know! I'd be happy to look at the data and see what the problem is, and see if other techniques can be applied.

If it uses too much memory, let's talk again and I can help troubleshoot.

sjackman commented 6 years ago

Cool. Thanks for all your help, Darren! I'll try it out. I ran qc on data/biogrid/biogrid-human, and the largest clique it found has 13 members. I modified it not to output cliques of size 2. It found 26142 vertices in 7329 cliques of size 3 or larger. I'm working with a biological data set myself. What does this biogrid-human graph represent?