Open jmarshallnz opened 3 years ago
Thanks for the contribution! I had never tested on a graph that large before, and these functions were written pretty early in the package's history.
Could you provide some more benchmark info? Specifically:
R
and igraph
are you using? I seem to remember back when I worked on these functions, using the adjacency matrix was much faster but igraph
has gone through a lot of changes since then.It'll take me some time to merge this; I have a few bug fixes that I would like to get to first.
I did indeed use igraph
native functions when I first wrote those functions; I changed them to use the adjacency matrix about 4 years ago. I recall it having been much slower with the graph functions, so if you don't mind testing smaller graphs (I think 10K is larger than almost any neuroimaging application) and vary the density and clustering, that would be really helpful.
If it is only beneficial for very large graphs, it might make sense to have different functionality based on the vertex count.
Indeed, speedup is largest for graphs with number of vertices high and average degree low (which of course is my particular use case!) It is slower for small graphs, but with anything larger than 1000 vertices it's faster. Lower memory use in all cases (much lower where the graph is sparse). Our actual use case had 28k vertices which was starting to stretch memory limitations.
Essentially the existing code is dominated by the creation of the adjacency matrix <-> graph conversion. So choose.edges() is faster in general with adjacency matrix method (on the order of twice as fast), but this reverses when memory allocation and setting for the large matrices is accounted for. If smaller (n < 1000) graphs are the dominate use case then having two methods is probably the way to go. I notice the sna
package checks size and for n<1000 uses adjacency matrices for similar computations.
Benchmark: n is number of vertices, and lambda is average Poisson vertex degree. No loops or multiple edges.
n lambda median_old median_new mem_alloc_old mem_alloc_new
<dbl> <dbl> <chr> <chr> <chr> <chr>
1 50 5 17.86ms 44.5ms 1.7MB 1.25MB
2 50 10 18.79ms 48.58ms 1.09MB 1.09MB
3 100 5 18.02ms 32.66ms 5.5MB 896.01KB
4 100 10 20.18ms 41.05ms 5.47MB 1.97MB
5 100 50 68.86ms 149.12ms 13.99MB 6.73MB
6 1000 5 172.35ms 40.92ms 238.34MB 3.69MB
7 1000 10 244.55ms 54.76ms 262.47MB 8MB
8 1000 50 581.44ms 149.1ms 341.03MB 41.72MB
9 1000 100 1s 604.83ms 377.79MB 90.76MB
10 1000 500 30.01s 19.05s 762.95MB 457.1MB
11 10000 5 23.64s 282.98ms 23.11GB 32.82MB
12 10000 10 27.02s 367.09ms 23.12GB 56.73MB
13 10000 50 42.96s 1.78s 34.41GB 269.7MB
14 10000 100 41.14s 6.59s 45.78GB 529.02MB
15 10000 500 2.33m 1.32m 44.72GB 4.79GB
igraph 1.2.6 with R 4.0.0
This makes sim.rand.graph.clust significantly faster and much less memory hungry.
The adjacency matrix was only really used for pretty standard functions that igraph supplies anyway.
For large graphs, this makes it way faster. Comparison on a graph with 10000 vertices:
So about 100 times faster, and around 1000 times less memory allocation. Rejoice! :)