Open Robinlovelace opened 4 years ago
For this model, I suggest using the new development OAs we have identified to extract the proportion of people who walk/cycle, but using the LSOAs they sit within for the travel to work routing. Does that sound workable?
That sounds like a good plan, although there is OA-WPZ data available: https://www.nomisweb.co.uk/census/2011/wf02ew
I think the amount of routing that would require is too much. Is there a simple way of looking up which LSOA each OA is in?
I think the amount of routing that would require is too much. Is there a simple way of looking up which LSOA each OA is in?
Good point, and with the LSOA - LSOA level analysis we can use the PCT routes as a quick proxy. I suggest:
sf::st_join(oa, lsoa)
To add LSOA variables to OA data.
There are oa to lsoa lookup tables on open geography portal
Another dataset we can use: https://data.london.gov.uk/dataset/property-build-period-lsoa
This dataset looks amazing! @Robinlovelace there really are data on age of housing stock for LSOAs across the country, not just in London. For London itself there's also data comparing housing stock age to council tax band, and data on dwelling type and number of bedrooms.
Great data exploring @joeytalbot I plan to start a script that will read them in, who knows could eventually be an ACTON function!
It certainly seems like it would make an appropriate function!
Heads-up @joeytalbot I've added a script that gets the data and deals with the pesky commas. Please take a look and work from that. Downing tools on this project for the day :hammer: .
Ok, so with that data there are 84 LSOAs across England and Wales containing >90% homes built in the years 2000-09. These could go straight into a model (excluding 20% of them for validation purposes), or I could look at a wider selection of LSOAs (eg there are 326 with >60% of homes built 2000-09) and then pick out the OAs within them that relate to the new developments - although if the remaining homes are mostly built during the 1990s this is probably still fine.
The next step is to get OD data - what's the best source for nationwide OD data, without having to download each region separately?
MSOA data:
od_data = pct::get_od()
#> No region provided. Returning national OD data.
#> Parsed with column specification:
#> cols(
#> `Area of residence` = col_character(),
#> `Area of workplace` = col_character(),
#> `All categories: Method of travel to work` = col_double(),
#> `Work mainly at or from home` = col_double(),
#> `Underground, metro, light rail, tram` = col_double(),
#> Train = col_double(),
#> `Bus, minibus or coach` = col_double(),
#> Taxi = col_double(),
#> `Motorcycle, scooter or moped` = col_double(),
#> `Driving a car or van` = col_double(),
#> `Passenger in a car or van` = col_double(),
#> Bicycle = col_double(),
#> `On foot` = col_double(),
#> `Other method of travel to work` = col_double()
#> )
#> Parsed with column specification:
#> cols(
#> MSOA11CD = col_character(),
#> MSOA11NM = col_character(),
#> BNGEAST = col_double(),
#> BNGNORTH = col_double(),
#> LONGITUDE = col_double(),
#> LATITUDE = col_double()
#> )
od_data
#> # A tibble: 2,402,201 x 18
#> geo_code1 geo_code2 all from_home light_rail train bus taxi motorbike
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 E02000001 E02000001 1506 0 73 41 32 9 1
#> 2 E02000001 E02000014 2 0 2 0 0 0 0
#> 3 E02000001 E02000016 3 0 1 0 2 0 0
#> 4 E02000001 E02000025 1 0 0 1 0 0 0
#> 5 E02000001 E02000028 1 0 0 0 0 0 0
#> 6 E02000001 E02000051 1 0 1 0 0 0 0
#> 7 E02000001 E02000053 2 0 2 0 0 0 0
#> 8 E02000001 E02000057 1 0 1 0 0 0 0
#> 9 E02000001 E02000058 1 0 0 0 0 0 0
#> 10 E02000001 E02000059 1 0 0 0 0 1 0
#> # … with 2,402,191 more rows, and 9 more variables: car_driver <dbl>,
#> # car_passenger <dbl>, bicycle <dbl>, foot <dbl>, other <dbl>,
#> # geo_name1 <chr>, geo_name2 <chr>, la_1 <chr>, la_2 <chr>
Created on 2020-04-03 by the reprex package (v0.3.0)
LSOA data:
library(sf)
u = "https://github.com/npct/pct-outputs-national/raw/master/commute/lsoa/l_all.Rds"
desire_lines_lsoa_national_sp = readRDS(url(u))
l = sf::st_as_sf(desire_lines_lsoa_national_sp)
plot(l[sample(nrow(l), size = 10000), ])
Output:
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 7.0.0
u = "https://github.com/npct/pct-outputs-national/raw/master/commute/lsoa/l_all.Rds"
desire_lines_lsoa_national_sp = readRDS(url(u))
l = sf::st_as_sf(desire_lines_lsoa_national_sp)
#> Loading required package: sp
#> Warning in CPL_crs_from_input(x): GDAL Message 1: +init=epsg:XXXX syntax is
#> deprecated. It might return a CRS with a non-EPSG compliant axis order.
plot(l[sample(nrow(l), size = 10000), ])
#> Warning: plotting the first 10 out of 141 attributes; use max.plot = 141 to plot
#> all
Created on 2020-04-03 by the reprex package (v0.3.0)
FYI that process uses around 4GB ram:
There seem to be 120 geocode2
from od_lsoas_short
that didn't make it into od_lsoas_short_routes
. I'm trying to work out why - these are normal geocodes within the UK. Given that all geocode1
from od_lsoas_short
appear in od_lsoas_short_routes
it doesn't seem like there could be problems with centroids that aren't on the road network.
> missing = od_lsoas_short[! od_lsoas_short$geocode2 %in% od_lsoas_short_routes$geocode2,]
There were 34 warnings (use warnings() to see them)
> dim(missing) # there are 1066 rows in od_lsoas_short where the geocode2 didn't make it into od_lsoas_short_routes. These are normal geocodes, within the UK
[1] 1066 16
> unique(od_lsoas_short$geocode2[! od_lsoas_short$geocode2 %in% od_lsoas_short_routes$geocode2])
[1] "E01026997" "E01026998" "E01026999" "E01027000" "E01027002" "E01027003" "E01027004" "E01027005" "E01027006" "E01027007" "E01027009"
[12] "E01027010" "E01027011" "E01027013" "E01027014" "E01027016" "E01027017" "E01027018" "E01027019" "E01027021" "E01027022" "E01027024"
[23] "E01027025" "E01027026" "E01027027" "E01027028" "E01027029" "E01027030" "E01027031" "E01027032" "E01027033" "E01027034" "E01027038"
[34] "E01027039" "E01027041" "E01027043" "E01027044" "E01027046" "E01027048" "E01027049" "E01027050" "E01027051" "E01027052" "E01027053"
[45] "E01027054" "E01027055" "E01027056" "E01027057" "E01027059" "E01027060" "E01027062" "E01027063" "E01027064" "E01027065" "E01027067"
[56] "E01027069" "E01027073" "E01027074" "E01027075" "E01027076" "E01027077" "E01027078" "E01027079" "E01027082" "E01027084" "E01027085"
[67] "E01027087" "E01027089" "E01027090" "E01027091" "E01027092" "E01027094" "E01027097" "E01027098" "E01027099" "E01027100" "E01027101"
[78] "E01031363" "E01031364" "E01031365" "E01031366" "E01031372" "E01031375" "E01031377" "E01031379" "E01031381" "E01031385" "E01031386"
[89] "E01031388" "E01031389" "E01031390" "E01031391" "E01031392" "E01031393" "E01031394" "E01031395" "E01031396" "E01031397" "E01031398"
[100] "E01031400" "E01031402" "E01031403" "E01031409" "E01031411" "E01031413" "E01031415" "E01031417" "E01031418" "E01031419" "E01031421"
[111] "E01031424" "E01031425" "E01031426" "E01031427" "E01031429" "E01031430" "E01031431" "E01031432" "E01031434" "E01031435"
> unique(od_lsoas_short$geocode1[! od_lsoas_short$geocode1 %in% od_lsoas_short_routes$geocode1])
character(0)
Interesting finding. Have you tried plotting the geocode1 locations associated with the missing codes?
Definitely not a random distribution!
The missing routes are just in two areas (Northamptonshire and West Sussex). They also seem to have (two sets of) consecutive row numbers.
> summary(missing$rownum) # are they particular rows ie a section of od_lsoas_short?
Min. 1st Qu. Median Mean 3rd Qu. Max.
218266 218532 218798 232867 257128 257394
Maybe worth looking at the associated chunks here: https://github.com/cyipt/acton/releases/download/0.0.1/od_lsoa_routes_20020-04-07-chunks-of-1000-rows.zip
Heads-up @joeytalbot I've been thinking about gradients and I'm not sure our calculations were correct. It's worth doing some sense checking on them I think. To help understand the issue, and to encode the hopefully-correct way of calculating gradients, with different levels of smoothing to account for the impact of anomalous short segments, I've written some functions into stplanr
to support this and ensure reproducibility and ease of use in the future.
Please take a look at these results for a single sample route. May be worth checking the results of route_rolling_gradient()
with different values for n (I suggest 5 is good and will smooth out the vast majority of anomalies but we could try different values also, the default is 2, meaning it takes the average of each segment and the next one but that doesn't seem to smooth it out that much):
remotes::install_github("ropensci/stplanr")
#> Using github PAT from envvar GITHUB_PAT
#> Skipping install of 'stplanr' from a github remote, the SHA1 (46e69e22) has not changed since last install.
#> Use `force = TRUE` to force installation
library(stplanr)
r1 <- od_data_routes[od_data_routes$route_number == 2, ]
y <- r1$elevations
distances <- r1$distances
route_rolling_gradient(y, distances)
#> [1] 0.009112333 0.029220779 0.106132075 0.036435786 0.047012302 0.048840049
#> [7] 0.017539244 NA
route_rolling_gradient(y, distances, abs = FALSE)
#> [1] -0.009112333 -0.029220779 -0.106132075 0.036435786 0.047012302
#> [6] -0.048840049 -0.017539244 NA
route_rolling_gradient(y, distances, n = 3)
#> [1] NA 0.02191558 0.14966741 0.03888604 0.05635534 0.01285072 0.02380952
#> [8] NA
route_rolling_gradient(y, distances, n = 4)
#> [1] NA 0.02371542 0.08991009 0.05085599 0.06604938 0.01523810 NA
#> [8] NA
r1$elevations_diff_1 <- route_rolling_diff(y, lag = 1)
r1$rolling_gradient <- route_rolling_gradient(y, distances, n = 2)
r1$rolling_gradient3 <- route_rolling_gradient(y, distances, n = 3)
r1$rolling_gradient4 <- route_rolling_gradient(y, distances, n = 4)
d <- cumsum(r1$distances) - r1$distances / 2
diff_above_mean <- r1$elevations_diff_1 + mean(y)
par(mfrow = c(2, 1))
plot(c(0, cumsum(r1$distances)), c(y, y[length(y)]), ylim = c(80, 130))
lines(c(0, cumsum(r1$distances)), c(y, y[length(y)]))
points(d, diff_above_mean )
abline(h = mean(y))
rg = r1$rolling_gradient
rg[is.na(rg)] <- 0
plot(c(0, d), c(0, rg), ylim = c(0, 0.2))
points(c(0, d), c(0, r1$rolling_gradient3), col = "blue")
points(c(0, d), c(0, r1$rolling_gradient4), col = "grey")
par(mfrow = c(1, 1))
y = od_data_routes$elevations[od_data_routes$route_number == 2]
y
#> [1] 112.5263 111.0000 106.5000 84.0000 96.6250 110.0000 107.1429 105.5556
route_rolling_average(y)
#> [1] NA 110.00877 100.50000 95.70833 96.87500 104.58929 107.56614
#> [8] NA
route_rolling_average(y, n = 1)
#> [1] 112.5263 111.0000 106.5000 84.0000 96.6250 110.0000 107.1429 105.5556
route_rolling_average(y, n = 2)
#> [1] 111.7632 108.7500 95.2500 90.3125 103.3125 108.5714 106.3492 NA
route_rolling_average(y, n = 3)
#> [1] NA 110.00877 100.50000 95.70833 96.87500 104.58929 107.56614
#> [8] NA
r1 <- od_data_routes[od_data_routes$route_number == 2, ]
elevations <- r1$elevations
distances <- r1$distances
route_average_gradient(elevations, distances) # an average of a 4% gradient
#> [1] 0.03907936
Created on 2020-04-16 by the reprex package (v0.3.0)
Heads-up @temospena I think this will be useful for the Lisbon work!
Heads-up @joeytalbot this is the reprex I was trying to do before:
library(stplanr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
l = pct::get_pct_lines("west-yorkshire")
l_top = l %>%
top_n(n = 10, bicycle)
r = route(l = l_top, route_fun = cyclestreets::journey)
#> Most common output is sf
#> Loading required namespace: data.table
names(r)
#> [1] "id" "geo_code1"
#> [3] "geo_code2" "geo_name1"
#> [5] "geo_name2" "lad11cd1"
#> [7] "lad11cd2" "lad_name1"
#> [9] "lad_name2" "all"
#> [11] "bicycle" "foot"
#> [13] "car_driver" "car_passenger"
#> [15] "motorbike" "train_tube"
#> [17] "bus" "taxi_other"
#> [19] "govtarget_slc" "govtarget_sic"
#> [21] "govtarget_slw" "govtarget_siw"
#> [23] "govtarget_sld" "govtarget_sid"
#> [25] "govtarget_slp" "govtarget_sip"
#> [27] "govtarget_slm" "govtarget_sim"
#> [29] "govtarget_slpt" "govtarget_sipt"
#> [31] "govnearmkt_slc" "govnearmkt_sic"
#> [33] "govnearmkt_slw" "govnearmkt_siw"
#> [35] "govnearmkt_sld" "govnearmkt_sid"
#> [37] "govnearmkt_slp" "govnearmkt_sip"
#> [39] "govnearmkt_slm" "govnearmkt_sim"
#> [41] "govnearmkt_slpt" "govnearmkt_sipt"
#> [43] "gendereq_slc" "gendereq_sic"
#> [45] "gendereq_slw" "gendereq_siw"
#> [47] "gendereq_sld" "gendereq_sid"
#> [49] "gendereq_slp" "gendereq_sip"
#> [51] "gendereq_slm" "gendereq_sim"
#> [53] "gendereq_slpt" "gendereq_sipt"
#> [55] "dutch_slc" "dutch_sic"
#> [57] "dutch_slw" "dutch_siw"
#> [59] "dutch_sld" "dutch_sid"
#> [61] "dutch_slp" "dutch_sip"
#> [63] "dutch_slm" "dutch_sim"
#> [65] "dutch_slpt" "dutch_sipt"
#> [67] "ebike_slc" "ebike_sic"
#> [69] "ebike_slw" "ebike_siw"
#> [71] "ebike_sld" "ebike_sid"
#> [73] "ebike_slp" "ebike_sip"
#> [75] "ebike_slm" "ebike_sim"
#> [77] "ebike_slpt" "ebike_sipt"
#> [79] "base_slcyclehours" "govtarget_sicyclehours"
#> [81] "govnearmkt_sicyclehours" "gendereq_sicyclehours"
#> [83] "dutch_sicyclehours" "ebike_sicyclehours"
#> [85] "base_sldeath" "base_slyll"
#> [87] "base_slvalueyll" "base_slsickdays"
#> [89] "base_slvaluesick" "base_slvaluecomb"
#> [91] "govtarget_sideath" "govtarget_siyll"
#> [93] "govtarget_sivalueyll" "govtarget_sisickdays"
#> [95] "govtarget_sivaluesick" "govtarget_sivaluecomb"
#> [97] "govnearmkt_sideath" "govnearmkt_siyll"
#> [99] "govnearmkt_sivalueyll" "govnearmkt_sisickdays"
#> [101] "govnearmkt_sivaluesick" "govnearmkt_sivaluecomb"
#> [103] "gendereq_sideath" "gendereq_siyll"
#> [105] "gendereq_sivalueyll" "gendereq_sisickdays"
#> [107] "gendereq_sivaluesick" "gendereq_sivaluecomb"
#> [109] "dutch_sideath" "dutch_siyll"
#> [111] "dutch_sivalueyll" "dutch_sisickdays"
#> [113] "dutch_sivaluesick" "dutch_sivaluecomb"
#> [115] "ebike_sideath" "ebike_siyll"
#> [117] "ebike_sivalueyll" "ebike_sisickdays"
#> [119] "ebike_sivaluesick" "ebike_sivaluecomb"
#> [121] "base_slcarkm" "base_slco2"
#> [123] "govtarget_sicarkm" "govtarget_sico2"
#> [125] "govnearmkt_sicarkm" "govnearmkt_sico2"
#> [127] "gendereq_sicarkm" "gendereq_sico2"
#> [129] "dutch_sicarkm" "dutch_sico2"
#> [131] "ebike_sicarkm" "ebike_sico2"
#> [133] "e_dist_km" "rf_dist_km"
#> [135] "rq_dist_km" "dist_rf_e"
#> [137] "dist_rq_rf" "rf_avslope_perc"
#> [139] "rq_avslope_perc" "rf_time_min"
#> [141] "rq_time_min" "route_number"
#> [143] "name" "distances"
#> [145] "time" "busynance"
#> [147] "elevations" "start_longitude"
#> [149] "start_latitude" "finish_longitude"
#> [151] "finish_latitude" "gradient_segment"
#> [153] "provisionName" "quietness"
#> [155] "geometry"
Created on 2020-04-21 by the reprex package (v0.3.0)
With something like: