UrbanAnalyst / dodgr

Distances on Directed Graphs in R
https://urbananalyst.github.io/dodgr/
128 stars 16 forks source link

dodgr_flows_aggregate aggregates integer flows into non-integers #137

Closed mem48 closed 3 years ago

mem48 commented 4 years ago

If all the flow values input to dodgr_flows_aggregate are whole numbers I would expect all the resulting flows to also be whole numbers, as they are simple sums.

Reprex

graph <- weight_streetnet (hampi)
from <- sample (graph$from_id, size = 10)
to <- sample (graph$to_id, size = 5)
to <- to [!to %in% from]
flows <- matrix (10 * sample(length (from) * length (to)),
                 nrow = length (from))

graph <- dodgr_flows_aggregate (graph, from = from, to = to, flows = flows)
flow <- graph$flow[graph$flow > 0]
summary(flow - floor(flow))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.03492 0.33333 0.53618 0.53750 0.74831 0.96939
mem48 commented 4 years ago

@unbrother you will be intrested in this

mpadge commented 4 years ago

Thanks @mem48, and you are indeed correct, but that is an intentional design decision as explained in #89. Everything is hard-coded as <double> (in C++ terms), and not <int>, because of the way the different heap sort algorithms are exposed within the code. However, if i do decide to pursue #117, which i suspect should be done, then that problem will no longer apply, and i'll be able to properly template the interfaces for both <int> and <double>. I'll keep this open as a parallel issue to that one, and keep you informed.

In the meantime, it would help if you could opine on #117 - have you ever resorted to anything but the default heap parameter? I suspect nobody really has, and it doesn't make much difference except for network structures that are wildly different from typical spatial networks. Even then, having a proper <int> rather than <double> heap is likely to make at least as much difference as different heaps, because <int> sorting is more efficient.

mem48 commented 4 years ago

Ok, but if these are floating point errors then shouldn't they be very small. I could understand getting a flow of 5.000001 but a flow of 5.5 seems like a large error.

I did a few more tests

library(dodgr)
graph <- weight_streetnet (hampi)
from <- sample (graph$from_id, size = 10)
to <- sample (graph$to_id, size = 5)
to <- to [!to %in% from]
flows <- matrix(rep(1, length(from) * length(to)),
                 nrow = length(from))

graph1 <- dodgr_flows_aggregate(graph, from = from, to = to, flows = flows, heap = "FHeap")
graph2 <- dodgr_flows_aggregate(graph, from = from, to = to, flows = flows, heap = "BHeap")
graph3 <- dodgr_flows_aggregate(graph, from = from, to = to, flows = flows, heap = "TriHeap")
#crashes graph4 <- dodgr_flows_aggregate(graph, from = from, to = to, flows = flows, heap = "TriHeapExt")
graph5 <- dodgr_flows_aggregate(graph, from = from, to = to, flows = flows, heap = "Heap23")

identical(graph1, graph2) #TRUE
identical(graph1, graph3) #TRUE
identical(graph1, graph5) #TRUE

summary(graph1$flow[graph1$flow > 0])
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.008475 0.015633 0.026527 0.033479 0.042153 0.090516 

Changing the heap made no difference except for TriHeapExt which crashed R (another issue to open). So from my perspective, they are not very useful. I also don't fully understand why and when to choose another heap, so I expect most users are using the default.

Also, notice the flow values I get. My input was a matrix of ones, so I would expect the minimum non-zero flow to be 1 and all other values to be a multiple of 1. Instead, I get some very small numbers.

As the hampi example is quite complex I tried something even simpler.

# Make a simple Tree in igraph
library(igraph)
graph_i <- make_tree(19,2,"undirected")
plot(graph_i)

# Convert to dodgr and make undirected
graph <- as_data_frame(graph_i)
graph2 <- graph[,2:1]
names(graph2) <- c("from","to")
graph <- rbind(graph, graph2)
graph$d <- 1

# Flow of 1 from all to all (same as edge betweenness)
flows <- matrix(rep(1, 19 * 19), ncol = 19)
graph <- dodgr_flows_aggregate(graph, from = 1:19, to = 1:19, flows = flows)
graph$flow

[1] 19.452381 18.452381 19.619048 12.866667 12.076190 12.076190 12.492857 12.492857  5.283333
[10]  5.283333  5.054762  5.054762  5.054762  5.054762  5.171429  5.171429  5.171429  5.171429
[19] 19.452381 18.452381 19.619048 12.866667 12.076190 12.076190 12.492857 12.492857  5.283333
[28]  5.283333  5.054762  5.054762  5.054762  5.054762  5.171429  5.171429  5.171429  5.171429

# Do edge betweeness in igraph
E(graph_i)$between <- edge.betweenness(graph_i)
E(graph_i)$between
[1] 88 84 84 48 48 48 48 48 18 18 18 18 18 18 18 18 18 18

I would have expected the dodgr results to be half of the igraph results as the dodgr graph if directed but has twice the edges. But these results are very different.

mpadge commented 4 years ago

Aha, i see. It's the norm_sums parameter of the dodgr_flows_... functions, which defaults to norm_sums = TRUE, for reasons demonstrated in #121. As explained in all the help files,

' @note The norm_sums parameter should be used whenever densities at origins

' and destinations are absolute values, and ensures that the sum of resultant

' flow values throughout the entire network equals the sum of densities at all

' origins. For example, with norm_sums = TRUE (the default), a flow from a

' single origin with density one to a single destination along two edges will

' allocate flows of one half to each of those edges, such that the sum of flows

' across the network will equal one, or the sum of densities from all origins.

' The norm_sums = TRUE option is appropriate where densities are relative

' values, and ensures that each edge maintains relative proportions. In the

' above example, flows along each of two edges would equal one, for a network

' sum of two, or greater than the sum of densities.

library (dodgr)
graph <- weight_streetnet (hampi)
from <- sample (graph$from_id, size = 10)
to <- sample (graph$to_id, size = 5)
to <- to [!to %in% from]
flows <- matrix(rep(1, length(from) * length(to)),
                                 nrow = length(from))

graph <- dodgr_flows_aggregate(graph, from = from, to = to, flows = flows)
summary(graph$flow[graph$flow > 0])
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> 0.003401 0.011322 0.013446 0.023629 0.031730 0.079733

graph <- dodgr_flows_aggregate(graph, from = from, to = to, flows = flows,
                               norm_sums = FALSE)
#> Warning in prepare_graph(graph, from, to): graph already has a 'flow' column;
#> this will be overwritten
summary(graph$flow[graph$flow > 0])
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   1.000   3.000   4.000   4.114   5.000  10.000

# Make a simple Tree in igraph
library(igraph)
graph_i <- make_tree(19,2,"undirected")
#plot(graph_i)

# Convert to dodgr and make undirected
graph <- as_data_frame(graph_i)
graph2 <- graph[,2:1]
names(graph2) <- c("from","to")
graph <- rbind(graph, graph2)
graph$d <- 1

# Flow of 1 from all to all (same as edge betweenness)
flows <- matrix(rep(1, 19 * 19), ncol = 19)
graph <- dodgr_flows_aggregate(graph, from = 1:19, to = 1:19, flows = flows)
graph$flow
#>  [1] 19.452381 18.452381 19.619048 12.866667 12.076190 12.076190 12.492857
#>  [8] 12.492857  5.283333  5.283333  5.054762  5.054762  5.054762  5.054762
#> [15]  5.171429  5.171429  5.171429  5.171429 19.452381 18.452381 19.619048
#> [22] 12.866667 12.076190 12.076190 12.492857 12.492857  5.283333  5.283333
#> [29]  5.054762  5.054762  5.054762  5.054762  5.171429  5.171429  5.171429
#> [36]  5.171429

graph <- dodgr_flows_aggregate(graph, from = 1:19, to = 1:19, flows = flows,
                               norm_sums = FALSE)
#> Warning in prepare_graph(graph, from, to): graph already has a 'flow' column;
#> this will be overwritten
graph$flow
#>  [1] 88 84 84 48 48 48 48 48 18 18 18 18 18 18 18 18 18 18 88 84 84 48 48 48 48
#> [26] 48 18 18 18 18 18 18 18 18 18 18

# Do edge betweeness in igraph
E(graph_i)$between <- edge.betweenness(graph_i)
E(graph_i)$between
#>  [1] 88 84 84 48 48 48 48 48 18 18 18 18 18 18 18 18 18 18

Created on 2020-06-23 by the reprex package (v0.3.0)

... and actually, that was an additional explicit reason why i did not implement strict <int> aggregation - the norm_sums = TRUE option requires <double> throughout, meaning a default of <double> is the only sensible option. This norm_sums business was implemented because the default value of TRUE is really the only sensible option for aggregation of actual flows. Does that resolve this issue @mem48?

mem48 commented 4 years ago

Ok, norm_sums = FALSE gives me the results that I was expecting, so that is a solution.

But I don't understand when norm_sums = TRUE is useful. The way I would use dodgr_flows_aggregate is to have flows representing people traveling, so usually whole numbers. Then I want to count how many people pass over each part of the road network. In this case norm_sums = TRUE seems to give a relative rather than absolute measure? But the help documentation suggests that norm_sums = TRUE gives absolute values.

mpadge commented 4 years ago

Say you've got the following network:

With a flow of 10 from A -> C and a flow of 10 from D -> C. Direct aggregation of flows will then give the following:

from to flow
A B 10
B C 20
D B 10

And the total flow aggregated throughout the entire network is 40, even though only 10 units are flowing from each of A and D. It might rightly be asserted that this total flow should be constrained to the total amounts emanating from each source/origin point, and so be constrained to 10 + 10 = 20. This is what norm_sums does, resulting in flow values of:

from to flow
A B 5
B C 10
D B 5

So you can see what it effectively does is to "spread" the flow from, say A -> C, equally along each of the intervening edges. If there are N edges between any two origin -> destination points with a flow between them of f, then the flow on each of those intervening edges will be f / N if norm_flows = TRUE and the total flow throughout the whole network will equal f, otherwise the flow along each edge will be f, and the total flow throughout the whole network will equal f N.

One context for considering the appropriateness of this whole functionality is OSM. A single way might be represented by a bunch of nodes along it simply because it is highly curved. Aggregating flows along with way using norm_sums = FALSE will lead to an aggregate flow along that way that is simply proportional to the entirely arbitrary number of points used to define it. A more realistic interpretation of what flow aggregation is or should be might suggest that the flow between any two junction points should aggregate along the entire intervening way to the value specified between those two points, and so the flow for each segment should be the f / N values described above. Make sense?

mem48 commented 4 years ago

I suppose it depends on what you then do with the results. I'm plotting them on a map, so a highly curved road is clear. But if you did some statiticaly analysis of the network I can see the normalisation being useful.