UrbanAnalyst / dodgr

Distances on Directed Graphs in R
https://urbananalyst.github.io/dodgr/
127 stars 16 forks source link

pairwise argument for dodgr_dists_categorical() #201

Closed xiaofanliang closed 1 year ago

xiaofanliang commented 1 year ago

Hi @mpadge thanks for developing this awesome package.

I am using dodgr for a simple scenario calculation: calculating the road type breakdown for each OD path for two network scenarios (one with an added trail called Model Miles). In short, for each OD path, I want to associate it with a percentage of distance traveled on this trail.

I understood the output of dodgr_dists_categorical() but I think it did not perform according to pairwise expectations. Maybe the solution here is to add a pairwise argument just like the dodgr_dists() function or there is something deeper that goes wrong with how dodgr_dists_categorical() functions (or how I understand it).

Here is what my data and output look like. The f and t correspond to two pairs of OD points. The first row of f and the first row of t is one set of OD. ScenNx2 is my graph with the trail. dodgr_dists_categorical(ScenNx2, from = f, to = t, proportions_only = FALSE) returns 2 * 2 distance matrix, but what I really need for the pairwise OD are just the diagonal ones (2519.708 and 4155.978). This is also the same output when I run dodgr_dists(ScenNx2, from=f, to=t, pairwise=T).

What confuses me is the output with $Model Miles, $other, and dodgr_dists_categorical(ScenNx2, from = f, to = t, proportions_only = TRUE). First, I thought the $Model Miles is the distance traveled on this road type, but then the distance should be less than the output in $distance (but now it is greater). $Model Miles values and $other should also add up to values in $distances, since these are the only two road types.

Second, dodgr_dists_categorical(ScenNx2, from = f, to = t, proportions_only = TRUE) reports that road type Model Miles is 22% (0.219385) and this number comes from the sum of all matrix values in $Model Miles divided by the sum of all matrix values in $Model Miles and in $other.

Model Miles: 6298.588 + 125.2818 + 4155.978 + 2638.8970 = 13218.7448 Other: 438.4226 + 0 + 1838.2226 + 0 = 2276.6452 Total distance: 13218.7448 + 2276.6452 = 15495.39 Model Miles proportion: 13218.7448 / 15495.39 = 0.219385

This does not work well for my type of pairwise data because it included extra distances from unintended trips in the proportion calculation. I tried simulating the pairwise calculation by giving only one row of data. It can report portion value if proportions_only = TRUE, but not if proportions_only = FALSE.

I am clueless about what I can tweak at this point. Is there a way to do the pairwise calculation for dodgr_dists_categorical() just like dodgr_dists()? Can you explain why the distance by road type is different from the total distance?

Screen Shot 2023-04-14 at 4 29 31 PM
mpadge commented 1 year ago

Thanks @xiaofanliang. Could you please provide a minimal reprex() so I can investigate futher?

mpadge commented 1 year ago

Note in the meantime that there is currently no pairwise parameter in the categorical dists function, but implementing that would be fairly trivial. I'll document progression on that here. Your other question is then the one I'd need a reprex to proceed further with:

Can you explain why the distance by road type is different from the total distance?

I'm not sure exactly what you mean by that?

xiaofanliang commented 1 year ago

@mpadge Thanks for the reply! I am working on a reprex example. Implementation of the pairwise parameter in the categorical dists function will be extremely helpful!! Thanks in advance!!

Can you explain why the distance by road type is different from the total distance?

Let me explain this further. The sums of the distance matrices are like the following. In this case, sum of $Model Miles (13218.7448) + sum of $other (2276.6452) does NOT equal the sum of $distance (22977.07). Why is this the case?

$Model Miles: 6298.588 + 125.2818 + 4155.978 + 2638.8970 = 13218.7448 $other: 438.4226 + 0 + 1838.2226 + 0 = 2276.6452 $distance: 2519.708 + 6298.588 + 10002.796 + 4155.978 = 22977.07

In addition, as mentioned before, the pairwise distances are essentially diagonal values of the $distance matrix. In this case, total distance between row1 in f and row1 in t is 2519.708.

Applying the same logic to the $Model Miles and $other distance matrices, the Model Mile distance between row1 in f and row1 in t is 6298.588. This number is significantly higher than total distance 2519.708. Since Model Mile is just one type of road that constitute the total distance, how can it be so much higher?

mpadge commented 1 year ago

Thanks @xiaofanliang. I suspect it's a problem of interpretation, rather than algorithm, but am still going to need some kind of reprex to investigate further. In the meantime, the current dev branch has a pairwise parameter in dodgr_dists_categorical(). PR #206 if you want to give it a try.

xiaofanliang commented 1 year ago

Hi @mpadge I tested the pairwise argument. It seems to work, but I discovered two errors when using the new pairwise argument. I could not get my data to work in reprex(), so I use your data to demonstrate the two issues below:

  1. when running dodgr_dists_categorical(net, from, to, proportions_only = FALSE, pairwise=T), the distance sum of all road types does not equal the total.
  2. when creating a new edge_type value to merge values across multiple road types into fewer categories, the distance in the new edge_type value SOMETIMES is wrong.

The following codes should be replicable if you copy and paste them on your local R:

# ---- example codes copied from https://github.com/ATFutures/dodgr ------#
hampi <- dodgr_streetnet ("hampi india")
cols <- c ("osm_id", "highway", "oneway", "geometry")
hampi <- hampi [, which (names (hampi) %in% cols)]
net <- weight_streetnet (hampi, wt_profile = "motorcar")
#from_x <- min (net$from_lon) + runif (10) * diff (range (net$from_lon)); values below generated with this code. 
from_x <- c(76.48706,76.48190,76.47498,76.47130,76.39222,76.39313,76.45629,76.45122,76.40285,76.42479)
#from_y <- min (net$from_lat) + runif (10) * diff (range (net$from_lat)); values below generated with this code
from_y <- c(15.34002,15.32931,15.35103,15.34152,15.31357,15.30398,15.32568,15.35853,15.32874,15.34179)
#to_x <- min (net$from_lon) + runif (10) * diff (range (net$from_lon)); values below generated with this code
to_x <- c(76.46619,76.43946,76.48950,76.48310,76.48122,76.38395,76.38631,76.46487,76.40837,76.47299)
#to_y <- min (net$from_lat) + runif (10) * diff (range (net$from_lat)); values below generated with this code
to_y <- c(15.35190,15.32674,15.31100,15.32958,15.33937,15.34499,15.33454,15.33149,15.34025,15.32579)
from = cbind (from_x, from_y)
to = cbind (to_x, to_y)

Below are my experiments. This run is supposed to establish the correct distance values for all road types in edge_type before I made any changes. However, as you can see, even in the second row, the distance by road types, 1477.868 + 2795.662 + 1023.983 is NOT equal to the total distance, 8078.789.

net$edge_type = net$highway
dodgr_dists_categorical(net, from, to, proportions_only = FALSE, pairwise=T)

#                           total living_street  primary residential secondary service tertiary unclassified
# 313796408-2195425013      0.000             0    0.000           0     0.000       0    0.000            0
# 2398957799-5351820917  8078.789             0 1477.868           0  2795.662       0 1023.983            0
# 571423436-245631169       0.000             0    0.000           0     0.000       0    0.000            0
# 571423782-2398957800      0.000             0    0.000           0     0.000       0    0.000            0
# 2588119056-313796408  11169.871             0 1477.868           0  2795.662       0 1721.951            0
# 2588119056-1388481924     0.000             0    0.000           0     0.000       0    0.000            0
# 286632889-1388482509      0.000             0    0.000           0     0.000       0    0.000            0
# 2195424991-977478668      0.000             0    0.000           0     0.000       0    0.000            0
# 1204772772-1204772780  2420.725             0 2420.725           0     0.000       0    0.000            0
# 1388482473-339574837      0.000             0    0.000           0     0.000       0    0.000            0

This runs shows that values in other road types (e.g., service, tertiary, etc.) can be successfully added up and merged under the new road type 'other'. However, under this run, the sum of distance by road types adds up to the total now (in comparison with the not-able-to-add-up experiment above).

net$edge_type = net$highway
net$edge_type[grep("^(service|tertiary|unclassified|secondary|living_street)", net$edge_type)] <- "other"
dodgr_dists_categorical(net, from, to, proportions_only = FALSE, pairwise=T)

#                           total    other  primary residential
# 313796408-2195425013      0.000    0.000    0.000           0
# 2398957799-5351820917  8078.789 6600.921 1477.868           0
# 571423436-245631169       0.000    0.000    0.000           0
# 571423782-2398957800      0.000    0.000    0.000           0
# 2588119056-313796408  11169.871 9692.003 1477.868           0
# 2588119056-1388481924     0.000    0.000    0.000           0
# 286632889-1388482509      0.000    0.000    0.000           0
# 2195424991-977478668      0.000    0.000    0.000           0
# 1204772772-1204772780  2420.725    0.000 2420.725           0
# 1388482473-339574837      0.000    0.000    0.000           0

AND, the merged values start to get wrong when I try to merge residential into other. primary road type losts all values.

net$edge_type = net$highway
net$edge_type[grep("^(service|tertiary|unclassified|secondary|living_street|residential)", net$edge_type)] <- "other"
dodgr_dists_categorical(net, from, to, proportions_only = FALSE, pairwise=T)

#                           total    other primary
# 313796408-2195425013      0.000    0.000       0
# 2398957799-5351820917  8078.789 6600.921       0
# 571423436-245631169       0.000    0.000       0
# 571423782-2398957800      0.000    0.000       0
# 2588119056-313796408  11169.871 9692.003       0
# 2588119056-1388481924     0.000    0.000       0
# 286632889-1388482509      0.000    0.000       0
# 2195424991-977478668      0.000    0.000       0
# 1204772772-1204772780  2420.725    0.000       0
# 1388482473-339574837      0.000    0.000       0

This error is not specific to the residential road type, because when I switched up the order of which road types to merge into other first, primary road type lost all values when merging service to other instead.

net$edge_type = net$highway
net$edge_type[grep("^(residential|living_street|secondary|unclassified|tertiary)", net$edge_type)] <- "other"
dodgr_dists_categorical(net, from, to, proportions_only = FALSE, pairwise=T)

#                           total    other  primary service
# 313796408-2195425013      0.000    0.000    0.000       0
# 2398957799-5351820917  8078.789 6600.921 1477.868       0
# 571423436-245631169       0.000    0.000    0.000       0
# 571423782-2398957800      0.000    0.000    0.000       0
# 2588119056-313796408  11169.871 9692.003 1477.868       0
# 2588119056-1388481924     0.000    0.000    0.000       0
# 286632889-1388482509      0.000    0.000    0.000       0
# 2195424991-977478668      0.000    0.000    0.000       0
# 1204772772-1204772780  2420.725    0.000 2420.725       0
# 1388482473-339574837      0.000    0.000    0.000       0

net$edge_type = net$highway
net$edge_type[grep("^(residential|living_street|secondary|unclassified|tertiary|service)", net$edge_type)] <- "other"
dodgr_dists_categorical(net, from, to, proportions_only = FALSE, pairwise=T)

#                           total    other primary
# 313796408-2195425013      0.000    0.000       0
# 2398957799-5351820917  8078.789 6600.921       0
# 571423436-245631169       0.000    0.000       0
# 571423782-2398957800      0.000    0.000       0
# 2588119056-313796408  11169.871 9692.003       0
# 2588119056-1388481924     0.000    0.000       0
# 286632889-1388482509      0.000    0.000       0
# 2195424991-977478668      0.000    0.000       0
# 1204772772-1204772780  2420.725    0.000       0
# 1388482473-339574837      0.000    0.000       0
mpadge commented 1 year ago

Yep, thanks @xiaofanliang, here it is in reprex form confirming what you report:

library (dodgr)
packageVersion ("dodgr")
#> [1] '0.2.20.12'
net <- weight_streetnet (hampi, wt_profile = "foot")
v <- dodgr_vertices (net)
set.seed (1L)
from <- sample (v$id, 20)
to <- sample (v$id, 20)

net$edge_type = net$highway
d <- dodgr_dists_categorical(net, from, to, proportions_only = FALSE, pairwise=T)
index_rows <- which (d [, 1] > 0)
index_cols <- which (colnames (d) != "total")
res <- data.frame (
    "total" = d [index_rows, 1],
    "sum" = rowSums (d [index_rows, index_cols]),
    "equal" = d [index_rows, 1] == rowSums (d [index_rows, index_cols])
)
print (res)
#>                           total       sum equal
#> 339318471-2419196948  5172.9294 4955.5964 FALSE
#> 2398957876-2627476997  836.8328  818.8737 FALSE
#> 2195424940-2627428647 1795.0054 1673.2638 FALSE
#> 338905103-1128374495  2227.7895 2227.7895 FALSE
#> 339574319-6161853534  2788.8161 2656.2922 FALSE
#> 7799711031-2398957543 3780.5121 3780.5121 FALSE
#> 5525252565-3921523004 2748.0184 2748.0184 FALSE
#> 7799710973-6161853492 1914.3623 1914.3623  TRUE
#> 2627444769-3921514722 1140.0290 1140.0290 FALSE

Created on 2023-05-08 with reprex v2.0.2

The "equal" column should all be TRUE. I'll work out what's going wrong and fix asap.


Edit: Code should have "sum" = colSums, and not "sum" = rowSums.

mpadge commented 1 year ago

Fixed. Note that values calculated different ways are not exactly equal, so are tested within some very small rounding error (1e-6 here).

library (dodgr)
packageVersion ("dodgr")
#> [1] '0.2.20.18'
net <- weight_streetnet (hampi, wt_profile = "foot")
v <- dodgr_vertices (net)
set.seed (1L)
from <- sample (v$id, 20)
to <- sample (v$id, 20)

net$edge_type <- net$highway
d0 <- dodgr_dists_categorical(net, from, to, proportions_only = FALSE, pairwise = FALSE)
types <- which (names (d0) != "distances")
d_types <- lapply (types, function (i) colSums (d0 [[i]]))
d_types <- colSums (do.call (rbind, d_types))
d_total <- colSums (d0$distances)
data.frame (
    total = d_total,
    sum = d_types,
    equal = abs (d_total - d_types) < 1e-6
)
#>                total       sum equal
#> 286632883  38899.682 38899.682  TRUE
#> 2588155749 66999.189 66999.189  TRUE
#> 2419196948 54469.840 54469.840  TRUE
#> 2627476997 33217.929 33217.929  TRUE
#> 2627428647  7607.069  7607.069  TRUE
#> 1398747978 11663.242 11663.242  TRUE
#> 1128374495 68063.683 68063.683  TRUE
#> 6161853534 40645.116 40645.116  TRUE
#> 2398957543 50427.253 50427.253  TRUE
#> 2398957532 51522.167 51522.167  TRUE
#> 1204772726 17737.670 17737.670  TRUE
#> 1398747999 22056.134 22056.134  TRUE
#> 571423780  11508.429 11508.429  TRUE
#> 2608076439  9260.622  9260.622  TRUE
#> 3921523004 22183.273 22183.273  TRUE
#> 2195424935 10494.464 10494.464  TRUE
#> 7794286132 18434.969 18434.969  TRUE
#> 6161853492 42632.915 42632.915  TRUE
#> 1427080646 38541.099 38541.099  TRUE
#> 3921514722 15349.371 15349.371  TRUE

d1 <- dodgr_dists_categorical (net, from, to, proportions_only = FALSE, pairwise = TRUE)
index_rows <- which (d1 [, 1] > 0)
index_cols <- which (colnames (d1) != "total")
data.frame (
    "total" = d1 [index_rows, 1],
    "sum" = rowSums (d1 [index_rows, index_cols]),
    "equal" = abs (d1 [index_rows, 1] - rowSums (d1 [index_rows, index_cols])) < 1e-6
)
#>                           total       sum equal
#> 339318471-2419196948  5172.9294 5172.9294  TRUE
#> 2398957876-2627476997  836.8328  836.8328  TRUE
#> 2195424940-2627428647 1795.0054 1795.0054  TRUE
#> 338905103-1128374495  2227.7895 2227.7895  TRUE
#> 339574319-6161853534  2788.8161 2788.8161  TRUE
#> 7799711031-2398957543 3780.5121 3780.5121  TRUE
#> 5525252565-3921523004 2748.0184 2748.0184  TRUE
#> 7799710973-6161853492 1914.3623 1914.3623  TRUE
#> 2627444769-3921514722 1140.0290 1140.0290  TRUE

# Confirm that proportions are also correct:
d2 <- dodgr_dists_categorical (net, from, to, proportions_only = TRUE)
d <- vapply (d0, function (i) sum (colSums (i)), numeric (1L))
data.frame (
    prop_only = d2,
    post_cal = d [-1] / d [1],
    equal = abs (d2 - (d [-1] / d [1])) < 1e-6
)
#>                  prop_only     post_cal equal
#> living_street 0.0015113115 0.0015113115  TRUE
#> path          0.4387094473 0.4387094473  TRUE
#> primary       0.1049314313 0.1049314313  TRUE
#> residential   0.0457802161 0.0457802161  TRUE
#> secondary     0.1179474264 0.1179474264  TRUE
#> service       0.0505999550 0.0505999550  TRUE
#> steps         0.0003041547 0.0003041547  TRUE
#> track         0.1379512328 0.1379512328  TRUE
#> unclassified  0.1022648249 0.1022648249  TRUE

Created on 2023-05-08 with reprex v2.0.2

Thank you very much @xiaofanliang for uncovering this bug. Let me know whether everything is also fixed for you now. I hope so :rocket: :+1:

xiaofanliang commented 1 year ago

Perfect! Thanks for the speedy updates @mpadge. It works great on both the default dataset and my dataset.