Testing network joining functions

Robin Lovelace

Introduction

Joining data is key to data science, allowing value to be added to disparate datasets by combining them.

There are various types of join, including based on shared ‘key’ values and shared space for spatial joins. However, neither of these join types works for joining network data of the type shown below, which represents 2 separate networks with different but related geometries (source: ATIP browse tool).

Imagine you want to know what kind of cycle infrastructure is associated with each segment of the MRN. That’s the kind of problem that network joins can tackle.

This guide outlines the challenges of network joining and demonstrates implementation-agnostic solutions.

It’s based on previous work:

The networkmerge project (and related parenx Python package available on pip): https://nptscot.github.io/networkmerge/
An approach in JavaScript at https://github.com/acteng/amat/tree/main/pct_lcwip_join, described at https://github.com/acteng/amat/blob/main/js/model.md#pct-join
The rnetmatch approach, which has been implemented in Rust with a nascent R wrapper (there are plans for a Python wrapper)
- See text for paper at https://github.com/nptscot/rnetmatch/blob/main/paper.qmd
- See description of the algorithm implemented in Rust here: https://github.com/nptscot/rnetmatch/blob/main/rust/README.md

We’ll use data from the Propensity to Cycle Tool and the OpenRoads dataset as an example.

Example dataset

Datasets were take from a few case study areas.

We’ll focus on Thornbury, West Yorkshire for now.

The following commands will load the datar (available in the data) folder in the repo into R (the equivalent function in Python would be geopandas.read_file):

net_x = sf::read_sf("data/open_roads_thornbury.geojson")
net_y = sf::read_sf("data/pct_thornbury.geojson")

For the purposes of this demo we’ll use projected data, although the functions should work with unprojected data too:

net_x = sf::st_transform(net_x, "EPSG:27700")
net_y = sf::st_transform(net_y, "EPSG:27700")

Subsetting the target ‘x’ network (optional)

A first step, to speed-up the join and reduce the size of the data, can be to keep only the records in the target ‘x’ dataset that are relevant. After this filtering step, the datasets look like this:

Of the four options, the third (with a distance of 20) looks like the best compromise between omitting unwanted links while retaining the majority of the network.

The most appropriate distance depends on your data and use case, it may be worth keeping more of the ‘x’ network than you need and using the join to filter out unwanted links (the subsetting stage is not essential).

Basic spatial join

A simple approach to joining the two networks is with a simple spatial join, using one of the available ‘binary predicates’, such as st_intersects, st_within (relevant for buffers), st_contains or st_touches.

The results when the flow attributes are summed are shown below:

Unfortunately the results are way out: the total flow in the joined network using this basic join approach is less than half (48 %) the total flow in the original network.

The rnet_merge approach

The rnet_join and rnet_merge() functions in {stplanr}were developed to address this issue. Code to undertake the join are shown below.

rnet_joined_1 = stplanr::rnet_join(net_x_subset_20, net_y, dist = 15)
rnet_joined_2 = stplanr::rnet_join(net_x_subset_20, net_y, dist = 15, segment_length = 10)

The results are based on ‘flat headed’ buffers around the x geometry, with results kept in this form in the output, as shown in the table and figure below.

id	flow	length_y
0306E7A4-C705-4776-A357-CA58B64396FA	0	9.832566
0306E7A4-C705-4776-A357-CA58B64396FA	0	9.832566
030E661F-4DAE-470F-807C-52B69BE295B3	3	9.464192

The plot shows big differences in the results of the rnet_join() function depending on the segment_length parameter, which splits y links into segments of the specified length, before doing the join to the buffer.

Let’s aggregate the results and re-join to the geometries of x, this time using the lengths of x and y to ensure more important values are given more weight, and see if the results are more accurate.

The total flow on the networks are 8 % and 97 % of the total flow in the original network, respectively. This shows the importance of the segment_length parameter in the rnet_join() function, which splits the ‘y’ network segments with attributes into segments.

An alternative approach, not yet implemented, would be to split y not at regular intervals but at intersections with x, which would be more accurate but more computationally intensive.

The results above show that the rnet_join() function works well, capturing the majority of the flow in the y network, with the segment_length parameter allowing the user to control the level of detail in the output. Users have full control over the aggregating functions used, which can be useful for different use cases, e.g. to classify the highway type, as shown in the code snippet below, which demonstrates the function’s ability to work in either direction (x and y can be swapped).

net_y = sf::read_sf("data/open_roads_thornbury.geojson") |>
  transmute(road_function = road_function) |>
  sf::st_transform(27700)
net_x = sf::read_sf("data/pct_thornbury.geojson") |>
  transmute(id = 1:n()) |>
  sf::st_transform(27700)
net_x = stplanr::rnet_subset(net_x, net_y, dist = 20)
rnet_joined = stplanr::rnet_join(net_x, net_y, segment_length = 10, dist = 15)
rnet_joined_values = rnet_joined |>
  sf::st_drop_geometry() |>
  group_by(id) |>
  summarise(
    # Most frequent road function:
    road_function = most_common_value(road_function)
    )
rnet_joined_x = left_join(net_x, rnet_joined_values, by = "id")
rnet_joined_x |>
  ggplot() +
  geom_sf(aes(colour = road_function)) +
  theme_void()

acteng / network-join-demos