UrbanAnalyst / dodgr

Distances on Directed Graphs in R
https://urbananalyst.github.io/dodgr/
127 stars 16 forks source link

[BUG] centrality fails on huge graphs #186

Closed mpadge closed 2 years ago

mpadge commented 2 years ago

On a contracted graph with > 500,000 edges, the centrality values are completely fragmented when plotted, and most of the network appears to have no centrality scores. The problem appears to lie within the C++ map-reduce algorithms, as demonstrated by the following reprex which uses pre-processed and locally-saved graphs and edge maps:

devtools::load_all ("/<path>/<to>/dodgr", export_all = TRUE)
#> ℹ Loading dodgr
setwd ("/<somewhere>/")
graph <- readRDS ("graph-std.Rds")
dim (graph)
#> [1] 555446      7

vert_map <- readRDS ("vert_map.Rds")
heap <- "BHeap"
dist_threshold <- .Machine$double.xmax
edges <- TRUE
nsamples <- 1:10
len <- vapply (nsamples, function (n) {
    cent <- rcpp_centrality (graph, vert_map, heap, dist_threshold, edges, n)
    length (which (is.finite (cent) & cent > 0))
    }, integer (1L))
plot (nsamples, len, "l", col = "red")
points (nsamples, len, pch = 19, col = "red")

Created on 2022-09-06 with reprex v2.0.2

Numbers of centrality values should increase monotonically with numbers of sample points from which centrality is calculated. That graph shows that they only increase up to 4 samples, after which they actually start to decrease. So values are indeed being lost somewhere, and not correctly reduced back on to the vector positions where they are supposed to be aggregated. Visualising results out to nsamples = 4 gives the kinds of results expected, with connected and consistent spatial patterns of flows, but as soon as nsamples is set to 5, the result fragments, and is clearly missing a heap of edges in the network.

mpadge commented 2 years ago

This is identical to the above, but the graph is less than half the size, at 200,000 edges instead of 500,000.

setwd ("/<path>/<to>/dodgr")
devtools::load_all (export_all = TRUE)
#> ℹ Loading dodgr
d <- "/<somewhere>/"
graph <- readRDS (file.path (d, "graph-std2.Rds"))
dim (graph)
#> [1] 201831      7
vert_map <- readRDS (file.path (d, "vert_map2.Rds"))
heap <- "BHeap"
dist_threshold <- .Machine$double.xmax
edges <- FALSE
nsamples <- 1:10
len <- vapply (nsamples, function (n) {
    cent <- rcpp_centrality (graph, vert_map, heap, dist_threshold, edges, n)
    length (which (is.finite (cent) & cent > 0))
    }, integer (1L))
plot (nsamples, len, "l", col = "red")
points (nsamples, len, pch = 19, col = "red")

Created on 2022-09-06 with reprex v2.0.2

Everything works perfectly there. That suggests some kind of memory issue might be at fault?

mpadge commented 2 years ago
==2288==
==2288== HEAP SUMMARY:
==2288==     in use at exit: 322,091,904 bytes in 30,616 blocks
==2288==   total heap usage: 2,841,664 allocs, 2,811,048 frees, 961,701,161 bytes allocated
==2288==
==2288== 3,168 bytes in 9 blocks are possibly lost in loss record 360 of 2,290
==2288==    at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==2288==    by 0x40149DA: allocate_dtv (dl-tls.c:286)
==2288==    by 0x40149DA: _dl_allocate_tls (dl-tls.c:532)
==2288==    by 0x70D9322: allocate_stack (allocatestack.c:622)
==2288==    by 0x70D9322: pthread_create@@GLIBC_2.2.5 (pthread_create.c:660)
==2288==    by 0x3436EB74: tthread::thread::thread(void (*)(void*), void*) (tinythread.h:901)
==2288==    by 0x34377C42: void RcppParallel::ttParallelReduce<OneCentralityVert>(unsigned long, unsigned long, OneCentralityVert&, unsigned long) (TinyThread.h:149)
==2288==    by 0x34373F8A: void RcppParallel::parallelReduce<OneCentralityVert>(unsigned long, unsigned long, OneCentralityVert&, unsigned long, int) (RcppParallel.h:64)
==2288==    by 0x3436701B: rcpp_centrality(Rcpp::DataFrame_Impl<Rcpp::PreserveStorage>, Rcpp::DataFrame_Impl<Rcpp::PreserveStorage>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, double, bool, int) (centrality.cpp:422)
==2288==    by 0x343D378A: _dodgr_rcpp_centrality (RcppExports.cpp:26)
==2288==    by 0x49487B3: R_doDotCall (dotcode.c:617)
==2288==    by 0x4948D74: do_dotcall (dotcode.c:1284)
==2288==    by 0x49A0892: Rf_eval (eval.c:851)
==2288==    by 0x49A3C82: do_begin (eval.c:2539)
==2288==
==2288== LEAK SUMMARY:
==2288==    definitely lost: 0 bytes in 0 blocks
==2288==    indirectly lost: 0 bytes in 0 blocks
==2288==      possibly lost: 3,168 bytes in 9 blocks
==2288==    still reachable: 322,088,736 bytes in 30,607 blocks
==2288==                       of which reachable via heuristic:
==2288==                         newarray           : 4,264 bytes in 1 blocks
==2288==         suppressed: 0 bytes in 0 blocks
==2288== Reachable blocks (those to which a pointer was found) are not shown.
==2288== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==2288==
==2288== For lists of detected and suppressed errors, rerun with: -s
==2288== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
mpadge commented 2 years ago

Update: This suspicion was misguided, but left for posterity anyway. Keep reading below ...


Current suspicion is that it's a numeric instability in the algorithm when trying to analyse centrality on graphs with lots of disconnected components. This instability doesn't really need to be resolved, as the more important issue is to ensure that centrality is only calculated on the largest fullyl-connected component of a graph. The above commit fixes graph contraction so that reduced versions of contracted graphs can be uncontracted to equivalently reduced full versions. This is important because contraction of compound juncitons means that components on contracted graphs will not necessarily reflect those on uncontracted graphs. Since centrality is always calculated on contracted graphs, this fix was needed to allow contracted versions to be reduced to largest connected comonent, and then uncontracted again.

mpadge commented 2 years ago

It was actually that edges were being traversed twice, and so doubly counted. This meant that everytime lines like this were executed: https://github.com/ATFutures/dodgr/blob/3af5fb9f5e0de3b78e516d47c6f0ec950c8ae58d/src/centrality.cpp#L247 instead of being incremented by 1, the sigma value was incremented by 2. The next increment was then 4; then 8, 16, 32, and the whole thing exponentially increased. Being stored as <int> values meant overflows were very quickly reached on large graphs, leading to NAN and garbage values. Fixed by inserting an unordered_set of edges to ensure each is incremented only once. Results now have the expected form and behaviour:

setwd ("/<path>/<to>/dodgr")
devtools::load_all (export_all = TRUE)
#> ℹ Loading dodgr
d <- "/<somewhere>/"
graph <- readRDS (file.path (d, "graph-std.Rds"))
graph <- dodgr_components (graph)
vert_map <- readRDS (file.path (d, "vert_map.Rds"))

heap <- "BHeap"
dist_threshold <- .Machine$double.xmax
edges <- FALSE
nsamples <- 1:10
len <- vapply (nsamples, function (n) {
    cent <- rcpp_centrality (graph, vert_map, heap, dist_threshold, edges, n)
    length (which (is.finite (cent) & cent > 0))
    }, integer (1L))
plot (nsamples, len, "l", col = "red", lwd = 2)
points (nsamples, len, pch = 19, col = "red")

Created on 2022-09-07 with reprex v2.0.2

mpadge commented 2 years ago

And this all happened because the really large graph had duplicated edges. Those are not checked in the C++ graph construction side, and so end up doubling up the counts linked to above. A bit under half of all edges were actually duplicated, causing the exponential scaling of the centrality aggregation. Re-opening to implement a check for duplicated edges to prevent this before passing through to the C++ side.

mpadge commented 2 years ago

Those commits add a pre-processing fn in R which checks for duplicated edges and issues a warnnig, along with modifying the C++ "dgraph" construction to only add unique edges. The latter renders the unordered_set checks in current code obsolete: https://github.com/ATFutures/dodgr/blob/7f9800efa7be1bf9bdeffbcbcc303115248ee748/src/centrality.cpp#L227-L236

Just need to benchmark a bit to see how much that slows things down, and then likely remove those checks to revert back to original form.

mpadge commented 2 years ago

They benchmark identically, so the unordered_set can be left for now. I'll close with a commit which adds comments to that code to explain, and link to this issue.

Robinlovelace commented 1 year ago

Impressive debugging and documentation work going on there Mark, great work!

mpadge commented 1 year ago

Thanks Robin! This is all for now-started new project - exciting times ahead!

Robinlovelace commented 1 year ago

Looking like a solid start, good luck with it and let me know if you want any input... Catch up soon!