Closed mem48 closed 7 months ago
The whole function I do not fully understand is
constrain_production = function(od, output_col, constraint_production) {
# todo: should the grouping var (the first column, 1) be an argument?
od_grouped = dplyr::group_by_at(od, 1)
od_grouped = dplyr::mutate(
od_grouped,
{{output_col}} := .data[[output_col]] /
sum(.data[[output_col]]) * mean( {{constraint_production}} )
)
od = dplyr::ungroup(od_grouped)
od
}
But is it simply updating a single column of od output_col
based on three variables col1
output_col
and constraint_production
?
In which case a function of the form
od[[output_col]] <- newfunc(od[[1]], od[[output_col]], od[[constraint_production]])
Should be much more efficient
Also when I run the example code in the README od_res
gains two new columns interaction
and output_col
The documentation suggests that interaction
is the expected result but output_col
seems to be the result of constrain_production ()
For example, if I modify si_calculate
to output the results before and after calling constrain_production
si_calculate = function(
od,
fun,
constraint_production,
constraint_attraction,
constraint_total,
output_col = "interaction",
...
) {
dots = rlang::enquos(...)
od = dplyr::mutate(od, {{output_col}} := fun(!!!dots))
if (!missing(constraint_production)) {
od2 = constrain_production(od, output_col, {{constraint_production}})
}
if (!missing(constraint_attraction)) {
od2 = constrain_attraction(od, output_col, {{constraint_attraction}})
}
if (!missing(constraint_total)) {
od2 = constrain_total(od, output_col, constraint_total)
}
return(list(od, od2))
}
And run
od_res2 = si_calculate(
od,
fun = gravity_model,
d = distance_euclidean,
m = origin_all,
n = destination_all,
constraint_production = origin_all,
beta = 0.9
)
I get
head(od_res2[[1]]$interaction)
[1] 7890481.0 267413.9 267413.9 5697769.0 124036.4 6105841.0
head(od_res2[[2]]$interaction)
[1] 7890481.0 267413.9 267413.9 5697769.0 124036.4 6105841.0
head(od_res2[[2]]$output_col)
[1] 2716.92163 92.07837 104.82740 2233.54982 48.62278 1715.26255
identical(od_res2[[1]]$interaction, od_res2[[2]]$interaction)
[1] TRUE
I tried this
constrain_production2 = function(grp, out, conts) {
dt = data.table::data.table(grp = grp, out = out, conts = conts)
dt = dt[,out := out / sum(out) * mean(conts), grp]
return(dt$out)
}
And got
# A tibble: 2 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 r1 13.01ms 13.61ms 70.0 966KB 4.38 32 2 457ms <NULL> <Rprofmem [784 × 3]> <bench_tm> <tibble>
2 r2 4.25ms 4.71ms 169. 751KB 2.09 81 1 478ms <NULL> <Rprofmem [285 × 3]> <bench_tm> <tibble>
where r1
is the current method and r2
is the new function.
That's only a 2x speedup right @mem48 ? The main bottleneck is that we don't need to calculate the full matrix. Think this is a fairly quick fix that will improve the package though... May need work upstream in od
.
Heads up @Robinlovelace I'm looking at porting the NPT shopping code to the PBCC for whole GB.I remember the @joeytalbot said
simodles
can't handle data zones and so had to use intermediate zones for NTP (See https://github.com/nptscot/npt/blob/main/R/utility_trips.R)I'm not an expert on SI models, but a quick look at https://github.com/Robinlovelace/simodels/blob/main/R/si_model.R make it look like the whole sf data frame is being pushed through a
group_by_at %>% mutate %>% ungroup
pipeline just to perform a relatively simple calculation?Moving all that extra data around would add a lot of overhead and I would assume that could be avoided with a simple select?