cwatson / brainGraph

Graph theory analysis of brain MRI data
166 stars 48 forks source link

Don't use an adjacency matrix for choose.edges. Way, way faster. #26

Open jmarshallnz opened 3 years ago

jmarshallnz commented 3 years ago

This makes sim.rand.graph.clust significantly faster and much less memory hungry.

The adjacency matrix was only really used for pretty standard functions that igraph supplies anyway.

For large graphs, this makes it way faster. Comparison on a graph with 10000 vertices:

  expression   min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time 
  <bch:expr> <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <lis>
1 test         27s    27s    0.0370    25.4GB    0.407     1    11        27s
1 test       290ms  311ms      3.22    36.1MB     4.83     2     3      621ms

So about 100 times faster, and around 1000 times less memory allocation. Rejoice! :)

cwatson commented 3 years ago

Thanks for the contribution! I had never tested on a graph that large before, and these functions were written pretty early in the package's history.

Could you provide some more benchmark info? Specifically:

It'll take me some time to merge this; I have a few bug fixes that I would like to get to first.

cwatson commented 3 years ago

I did indeed use igraph native functions when I first wrote those functions; I changed them to use the adjacency matrix about 4 years ago. I recall it having been much slower with the graph functions, so if you don't mind testing smaller graphs (I think 10K is larger than almost any neuroimaging application) and vary the density and clustering, that would be really helpful.

If it is only beneficial for very large graphs, it might make sense to have different functionality based on the vertex count.

jmarshallnz commented 3 years ago

Indeed, speedup is largest for graphs with number of vertices high and average degree low (which of course is my particular use case!) It is slower for small graphs, but with anything larger than 1000 vertices it's faster. Lower memory use in all cases (much lower where the graph is sparse). Our actual use case had 28k vertices which was starting to stretch memory limitations.

Essentially the existing code is dominated by the creation of the adjacency matrix <-> graph conversion. So choose.edges() is faster in general with adjacency matrix method (on the order of twice as fast), but this reverses when memory allocation and setting for the large matrices is accounted for. If smaller (n < 1000) graphs are the dominate use case then having two methods is probably the way to go. I notice the sna package checks size and for n<1000 uses adjacency matrices for similar computations.

Benchmark: n is number of vertices, and lambda is average Poisson vertex degree. No loops or multiple edges.

       n lambda median_old median_new mem_alloc_old mem_alloc_new
   <dbl>  <dbl> <chr>      <chr>      <chr>         <chr>        
 1    50      5 17.86ms    44.5ms     1.7MB         1.25MB       
 2    50     10 18.79ms    48.58ms    1.09MB        1.09MB       
 3   100      5 18.02ms    32.66ms    5.5MB         896.01KB     
 4   100     10 20.18ms    41.05ms    5.47MB        1.97MB       
 5   100     50 68.86ms    149.12ms   13.99MB       6.73MB       
 6  1000      5 172.35ms   40.92ms    238.34MB      3.69MB       
 7  1000     10 244.55ms   54.76ms    262.47MB      8MB          
 8  1000     50 581.44ms   149.1ms    341.03MB      41.72MB      
 9  1000    100 1s         604.83ms   377.79MB      90.76MB      
10  1000    500 30.01s     19.05s     762.95MB      457.1MB      
11 10000      5 23.64s     282.98ms   23.11GB       32.82MB      
12 10000     10 27.02s     367.09ms   23.12GB       56.73MB      
13 10000     50 42.96s     1.78s      34.41GB       269.7MB      
14 10000    100 41.14s     6.59s      45.78GB       529.02MB     
15 10000    500 2.33m      1.32m      44.72GB       4.79GB     

igraph 1.2.6 with R 4.0.0