Add new `subpoint_origins` and `subpoint_destinations` optional arguments

Robinlovelace commented 2 years ago

In many OD datasets the locations of destinations (e.g. work places, shops, schools) are different than the locations of the origins (e.g. residential buildings). Some destinations attract more trips than others, so weighting values are probably also needed.

Based on input data in #8, I imagine this could work something like this:

odjitter --od-csv-path data/od_school.csv \
  --zones-path data/zones.geojson \
  --subpoints_destinations data/schools.geojson \
  --weight_key_destinations weight \
  --all-key car \
  --max-per-od 10 --output-path output_to_schools_max_10.geojson

A naive approach, that I think should at least provide an output (but errors when I try it) is as follows:

odjitter --od-csv-path data/od_school.csv \
  --zones-path data/zones.geojson \
  --subpoints-path data/schools.geojson \
  --max-per-od 50 --output-path output_max50.geojson

Illustration of what the output could look like (with --max-per-od 1000 in this case):

Robinlovelace commented 2 years ago

Proposed input data for destinations below, open data from OSM obtained using reproducible script.

Any comments at this early stage welcome @dabreegster.

```r # Aim: create a minimal example to illustrate the code needed for #104 # See https://github.com/dabreegster/odjitter/issues/7 library(tidyverse) library(osmdata) library(sf) library(tmap) tmap_mode("view") zones = sf::st_read("https://github.com/dabreegster/odjitter/raw/main/data/zones.geojson") amenities_osm = opq(zones %>% st_union()) %>% add_osm_feature(key = "amenity") %>% osmdata_sf() amenities_osm # Object of class 'osmdata' with: # $bbox : 55.9252709,-3.2356809,55.9586722,-3.1487247 # $overpass_call : The call submitted to the overpass API # $meta : metadata including timestamp and version numbers # $osm_points : 'sf' Simple Features Collection with 22066 points # $osm_lines : 'sf' Simple Features Collection with 14 linestrings # $osm_polygons : 'sf' Simple Features Collection with 2085 polygons # $osm_multilines : 'sf' Simple Features Collection with 2 multilinestrings # $osm_multipolygons : 'sf' Simple Features Collection with 10 multipolygons amenities_osm_polygons = amenities_osm$osm_polygons schools_osm_polygons = amenities_osm_polygons %>% filter(amenity == "school") schools_centroids = st_centroid(schools_osm_polygons) qtm(zones) + qtm(schools_centroids) ```

Robinlovelace commented 2 years ago

See #8

Based on reproducible example shown below

``` r # Aim: create a minimal example to illustrate the code needed for #104 # See https://github.com/dabreegster/odjitter/issues/7 library(tidyverse) library(osmdata) #> Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright library(sf) #> Linking to GEOS 3.10.1, GDAL 3.4.0, PROJ 8.2.0; sf_use_s2() is TRUE library(tmap) tmap_mode("view") #> tmap mode set to interactive viewing zones = sf::st_read("https://github.com/dabreegster/odjitter/raw/main/data/zones.geojson") #> Reading layer `zones_min' from data source #> `https://github.com/dabreegster/odjitter/raw/main/data/zones.geojson' #> using driver `GeoJSON' #> Simple feature collection with 7 features and 9 fields #> Geometry type: MULTIPOLYGON #> Dimension: XY #> Bounding box: xmin: -3.235681 ymin: 55.92527 xmax: -3.148725 ymax: 55.95867 #> Geodetic CRS: WGS 84 amenities_osm = opq(zones %>% st_union()) %>% add_osm_feature(key = "amenity") %>% osmdata_sf() amenities_osm #> Object of class 'osmdata' with: #> $bbox : 55.9252709,-3.2356809,55.9586722,-3.1487247 #> $overpass_call : The call submitted to the overpass API #> $meta : metadata including timestamp and version numbers #> $osm_points : 'sf' Simple Features Collection with 22066 points #> $osm_lines : 'sf' Simple Features Collection with 14 linestrings #> $osm_polygons : 'sf' Simple Features Collection with 2085 polygons #> $osm_multilines : 'sf' Simple Features Collection with 2 multilinestrings #> $osm_multipolygons : 'sf' Simple Features Collection with 10 multipolygons # Object of class 'osmdata' with: # $bbox : 55.9252709,-3.2356809,55.9586722,-3.1487247 # $overpass_call : The call submitted to the overpass API # $meta : metadata including timestamp and version numbers # $osm_points : 'sf' Simple Features Collection with 22066 points # $osm_lines : 'sf' Simple Features Collection with 14 linestrings # $osm_polygons : 'sf' Simple Features Collection with 2085 polygons # $osm_multilines : 'sf' Simple Features Collection with 2 multilinestrings # $osm_multipolygons : 'sf' Simple Features Collection with 10 multipolygons amenities_osm_polygons = amenities_osm$osm_polygons amenities_osm_polygons$weight = round(as.numeric(st_area(amenities_osm_polygons)) / 10) schools_osm_polygons = amenities_osm_polygons %>% filter(amenity == "school") %>% select(id = osm_id, weight) schools_centroids = st_centroid(schools_osm_polygons) #> Warning in st_centroid.sf(schools_osm_polygons): st_centroid assumes attributes #> are constant over geometries of x schools_centroids = schools_centroids[zones, ] qtm(zones) + qtm(schools_centroids) ``` ![](https://i.imgur.com/0jCCPld.png) ``` r zones_containing_schools = zones[schools_centroids, ] od_schools_matrix = expand.grid(zones$InterZone, zones_containing_schools$InterZone) od_schools_df = od_schools_matrix %>% as_tibble() od_schools = od_schools_df %>% rename(origin = Var1, destination = Var2) %>% mutate_all(as.character) zones_schools_joined = st_join(zones_containing_schools, schools_centroids) zones_schools_joined_aggregated = zones_schools_joined %>% sf::st_drop_geometry() %>% group_by(destination = InterZone) %>% summarise(weight = sum(weight)) od_schools = left_join(od_schools, zones_schools_joined_aggregated) #> Joining, by = "destination" od_schools_sf = od::od_to_sf(od_schools, zones) #> 0 origins with no match in zone ids #> 0 destinations with no match in zone ids #> points not in od data removed. od_schools_sf$distance_m = as.numeric(sf::st_length(od_schools_sf)) od_schools_sf = od_schools_sf %>% mutate( walk = 100 / (1000 + distance_m) * 0.3 * weight, bike = 100 / (1000 + distance_m) * 0.01 * weight, car = (distance_m / max(distance_m)) * 0.06 * weight, other = 100 / (1000 + distance_m) * 0.09 * weight ) %>% mutate_if(is.numeric, round) %>% select(origin, destination, walk, bike, other, car) plot(od_schools_sf) ``` ![](https://i.imgur.com/8QcJPPO.png) ``` r od_schools = od_schools_sf %>% sf::st_drop_geometry() summary(od_schools) #> origin destination walk bike #> Length:35 Length:35 Min. : 0.00 Min. :0.000 #> Class :character Class :character 1st Qu.: 2.50 1st Qu.:0.000 #> Mode :character Mode :character Median : 9.00 Median :0.000 #> Mean : 35.83 Mean :1.171 #> 3rd Qu.: 59.00 3rd Qu.:2.000 #> Max. :232.00 Max. :8.000 #> other car #> Min. : 0.0 Min. : 0.00 #> 1st Qu.: 1.0 1st Qu.: 1.00 #> Median : 3.0 Median : 13.00 #> Mean :10.8 Mean : 81.83 #> 3rd Qu.:18.0 3rd Qu.:139.50 #> Max. :70.0 Max. :464.00 ``` ^{Created on 2022-02-02 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)}

dabreegster commented 2 years ago

I understand, thanks! (Earlier on Slack, I was confused, but you're basically trying to specify different subpoints and weights for origins and destinations.) The CLI, example data, and format all look reasonable -- I'll see if I can implement this tomorrow

dabreegster commented 2 years ago

A few questions based on the new dataset...

1) In your first command, I assume not specifying origin subpoints means to randomly pick points? 2) od_schools.csv has per-mode columns, but not an all column like we've been doing. I can make that field optional and just sum everything up, at the cost of a little more complexity in odjitter. I'm not a huge fan of accepting a variety of input formats, because then you have to clearly document all of them, but this seems like a small case to support. Any thoughts? If people providing the input can easily sum the column on their end, I'd prefer that. (I could run your repro and add the column if so as some R practice!)

dabreegster commented 2 years ago

A naive approach, that I think should at least provide an output (but errors when I try it) is as follows:

That's because of a TODO in the code -- we only scrape subpoints from line-string input. I'll add support for geojson containing points. (And we should support all geojson types)

dabreegster commented 2 years ago

Started https://github.com/dabreegster/odjitter/tree/weighted_subp. Few things left to do. First I want to agree on the interface, though. I think I have a strong preference for having a simple API / surface area, instead of taking a bunch of variations of input. Especially if the caller is using Python or R or some other language where it's trivial to do data transformation, I think doing the "reshaping" work there is simpler.

So specifically:

1) Separate --subpoints-origins-path and --subpoints-destinations-path flags. If the caller wants to use the same subpoints for both, they repeat the path -- no extra flags or handling for that. There could be cases where scraping and storing the subpoints once for both groups could be advantageous performance wise, but I'll add that optimization only when it's really needed.

2) The CSV input requires some column for "all trips." So the test dataset you added, I propose we add the column there and always require it input.

Robinlovelace commented 2 years ago

Thanks for the questions, on the case, replies coming up...

Robinlovelace commented 2 years ago

First re later replies:

Separate --subpoints-origins-path and --subpoints-destinations-path flags. If the caller wants to use the same subpoints for both, they repeat the path -- no extra flags or handling for that. There could be cases where scraping and storing the subpoints once for both groups could be advantageous performance wise, but I'll add that optimization only when it's really needed.

Fully agreed, more specific is better and bindings can do the shortcuts.

2. The CSV input requires some column for "all trips." So the test dataset you added, I propose we add the column there and always require it input.

Yes that was deliberate but instead of changing the data I suggest setting the 'all' one to number of cars, for example. I will change the description of the CLI call above, setting all-key. Many OD datasets lack an all column so good to have test data that covers those cases in the wild.

dabreegster commented 2 years ago

I suggest setting the 'all' one to number of cars

The first row is S02001616,S02001616,232,8,70,0 -- cars is 0, walk 232, bike 8, other 70. Would that make sense there?

I'm sure many datasets don't have an all column, but I'm proposing we require it for input into this tool. I don't think odjitter should be robust to a bunch of different types of input, when that data cleanup / transformation is going to happen anyway in a language better suited for it

Robinlovelace commented 2 years ago

I'm sure many datasets don't have an all column, but I'm proposing we require it for input into this tool. I don't think odjitter should be robust to a bunch of different types of input, when that data cleanup / transformation is going to happen anyway in a language better suited for it

Fine by me. In cases where the user is interested in bike trips, for example, and --all-key is set to bike, there would be no need to have an all column in the CSV dataset, right?

dabreegster commented 2 years ago

I guess not. But all the numeric columns will be scaled by bike / max_per_od, and that might be nonsensical.

So for the unit test of the school data, shall we use the car or bike column? Or add all?

Robinlovelace commented 2 years ago

I guess not. But all the numeric columns will be scaled by bike / max_per_od, and that might be nonsensical.

I don't think so. I think each numeric output would be scaled by the number of disaggregated OD pairs per original OD pair. That would be determined by n. bike.

Robinlovelace commented 2 years ago

So for the unit test of the school data, shall we use the car or bike column? Or add all?

I vote car but any should work. It depends which one is the focus. For current work looking at baseline cycle networks bike would make sense but alas the flows are vanishingly small in many areas, including in this synthetic test dataset.

dabreegster commented 2 years ago

So say we set all-key to "bike", and for some desire line, bike has 0, but car and walk have some trips. What should we do -- zero out car and walk too? (This is not hypothetical, it's happening in the test I'm setting up :) )

Robinlovelace commented 2 years ago

So say we set all-key to "bike", and for some desire line, bike has 0, but car and walk have some trips. What should we do -- zero out car and walk too? (This is not hypothetical, it's happening in the test I'm setting up :) )

It's great that the example data has this edge case. In that case there will be no disaggregation: the max-per-od argument only kicks in to split up lines into more than 1 sub-line when there are more than that value in the all-key column. Maybe that argument should be threshold-key as it's the key that, when values go above a certain threshold value, triggers disaggregation.

Robinlovelace commented 2 years ago

Done, and with weights, in #14

Robinlovelace commented 2 years ago

Another sanity check just implemented:

Thanks @dabreegster, I go to bed happy : )

dabreegster / odjitter

Add new `subpoint_origins` and `subpoint_destinations` optional arguments #7