cwatson / brainGraph

Graph theory analysis of brain MRI data
166 stars 48 forks source link

Parallel processing for calculating efficiency of very large graphs #15

Closed richpauloo closed 3 years ago

richpauloo commented 4 years ago

As discussed in this Stack Overflow question, brainGraph::efficiency() uses igraph::distance to calculate distance matrices.

When the number of vertices is large, this can easily eat up all of the memory of a local machine.

A solution that @cwatson proposed on SO is to run the calculation in parallel for sufficiently large graphs. I'm unsure of where to draw this line.

Moreover, since even parallel calculations can easily eat up memory and take hours to solve, I want to add that the best way to approach this might be to throw an error that prompts the user to either:

cwatson commented 4 years ago

Thanks for posting here. igraph is supposed to work for graphs with 1MM+ vertices, but I do not know if those cases are typically relatively sparse graphs. brainGraph, on the other hand, was originally "only intended for" graphs with up to ~1,000 vertices or so (at least, I never really considered the issue of very very large, dense graphs). It may be best to contact the igraph folks for their thoughts.

It is true that it would be difficult to determine what the cutoff would be. I did a bit of quick testing on my computer (laptop, 4-core @2.5 GHz i7-6500U) and found that at around 650-675 vertices, the parallel solution started to become faster (at first, by only a few milliseconds or so). My solution at first glance would be to simply check the vertex count and have a switch that uses foreach to loop across vertices, or stick with the function as-is. i.e.,

n <- vcount(g)
if (n > 650) {
  D <- foreach(i=seq_len(n), .combine='rbind') %dopar% {
    distances(g, from=i, weights=weights)
} else {
  ...
}

The current use.parallel could jump to that automatically. Currently, I expect the user to set the number of cores and register the cluster at the start of the session (using code I give as examples in README.md and the User Guide). An alternative would be to have a global option a la the boot package (i.e., something like brainGraph.parallel and brainGraph.ncpus) and have code to use either parallel or snow. I would like to avoid this as it would lengthen every function using foreach, but it is certainly doable.

I will think about this issue more and come up with a solution. I am not sure if it will make it into the next release (v3.0.0) which is coming soon.