cyclestreets / cyclestreets-r

An R interface to cyclestreets.net APIs
https://rpackage.cyclestreets.net/
GNU General Public License v3.0
27 stars 7 forks source link

Speed-up json2sf_cs #69

Closed Robinlovelace closed 1 year ago

Robinlovelace commented 1 year ago

As outlined in https://github.com/cyclestreets/cyclestreets-r/issues/65 this function is the major perf bottleneck at this stage.

Robinlovelace commented 1 year ago

Not much in it for txt2coords():

txt2coords = function(txt) {
  # helper function to document...
  coords_split = stringr::str_split(txt, pattern = " |,")[[1]]
  matrix(as.numeric(coords_split),
         ncol = 2,
         byrow = TRUE)
}
txt2coords2 = function(txt) {
  if(is.na(txt)){
    return(NULL)
  }
  coords_split = stringr::str_split(txt, pattern = " |,")[[1]]
  coords_split = matrix(as.numeric(coords_split),
                        ncol = 2,
                        byrow = TRUE)
  sf::st_linestring(coords_split)
}
txt2coords3 = function(txt) {
  # helper function to document...
  coords_split = stringi::stri_split(txt, regex = " |,")
  matrix(as.numeric(coords_split[[1]]), ncol = 2, byrow = TRUE)
}

f = system.file(package = "cyclestreets", "extdata/journey.json")
obj = jsonlite::read_json(f, simplifyVector = TRUE)
txt = obj$marker$`@attributes`$points[2]
c1 = txt2coords(txt)
c2 = txt2coords2(txt)
c3 = txt2coords3(txt)
waldo::compare(c1, c2)
#> `old` is a double vector (-1.54408, -1.54399, -1.54336, -1.54331, -1.54329, ...)
#> `new` is an S3 object of class <XY/LINESTRING/sfg>, a double vector
waldo::compare(c1, c3)
#> ✔ No differences
# `old` is a double vector (-1.54408, -1.54399, -1.54336, -1.54331, -1.54329, ...)
# `new` is an S3 object of class <XY/LINESTRING/sfg>, a double vector
bench::mark(check = FALSE,
  c1 = txt2coords(txt),
  c2 = txt2coords2(txt),
  c3 = txt2coords2(txt)
)
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 c1           11.2µs   12.3µs    75749.      42KB     37.9
#> 2 c2           17.2µs     19µs    49033.    29.1KB     39.3
#> 3 c3           17.2µs   19.2µs    48149.      264B     43.4

Created on 2023-07-08 with reprex v2.0.2

Robinlovelace commented 1 year ago

These are the culprits it seems:

image

f = system.file(package = "cyclestreets", "extdata/journey.json")
obj = jsonlite::read_json(f, simplifyVector = TRUE)
rsf = json2sf_cs(obj, cols = c("distances"))
bench::mark(
  test = json2sf_cs(obj, cols = c("distances"))
)
# 90 itr/sec # Around 30 itr/sec for typical commute routes
f = function() {
  f = system.file(package = "cyclestreets", "extdata/journey.json")
  obj = jsonlite::read_json(f, simplifyVector = TRUE)
  json2sf_cs(obj, cols = c("distances"))
}
profvis::profvis(f())
Robinlovelace commented 1 year ago

The relevant bit:

bench::mark(data.frame(vals_constant)[rep(1, n_segs),])
# A tibble: 1 × 13
  expression                             min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
  <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
1 data.frame(vals_constant)[rep(1, n_… 1.5ms 1.54ms      619.    5.81KB     22.2   251     9      405ms <df>   <Rprofmem> <bench_tm> <tibble>
Robinlovelace commented 1 year ago

Now around 2x faster, was getting 90 r/s before:


# A tibble: 1 × 13   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result       memory              time            gc         <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>       <list>              <list>          <list>   1 test         5.14ms   5.39ms      147.    28.5KB     4.21    70     2      475ms <sf [5 × 6]> <Rprofmem [52 × 3]> <bench_tm [72]> <tibble>
--
 
> | >
>
Robinlovelace commented 1 year ago
bench::mark(check = FALSE,
+   original = cyclestreets::json2sf_cs(obj, cols = c("distances")),
+   new = json2sf_cs2(obj, cols = c("distances"))
+ )
# A tibble: 2 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory                 time            gc      
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>                 <list>          <list>  
1 original     9.05ms    9.4ms      106.    1.06MB     9.04    47     4      443ms <NULL> <Rprofmem [689 × 3]>   <bench_tm [51]> <tibble>
2 new           4.9ms   5.09ms      197.    1.19MB     6.34    93     3      473ms <NULL> <Rprofmem [2,503 × 3]> <bench_tm [96]> <tibble>
Robinlovelace commented 1 year ago

It's 3x faster after the commit above.

Robinlovelace commented 1 year ago

Demo:

f = system.file(package = "cyclestreets", "extdata/journey.json")
obj = jsonlite::read_json(f, simplifyVector = TRUE)
rsf = json2sf_cs(obj, cols = c("distances"))
bench::mark(
  test = json2sf_cs(obj, cols = c("distances"))
)
# 90 itr/sec # Around 30 itr/sec for typical commute routes
f = function() {
  f = system.file(package = "cyclestreets", "extdata/journey.json")
  obj = jsonlite::read_json(f, simplifyVector = TRUE)
  json2sf_cs(obj, cols = c("distances"))
}
profvis::profvis(f())
Robinlovelace commented 1 year ago
# A tibble: 1 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result       memory              time             gc      
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>       <list>              <list>           <list>  
1 test         3.33ms   3.62ms      259.    28.2KB     6.52   119     3      460ms <sf [5 × 4]> <Rprofmem [53 × 3]> <bench_tm [122]> <tibble>
Robinlovelace commented 1 year ago

Confirmed on 7 MB file:

profvis::profvis(batch_read("test-data-7mb.csv"))
Reading in the following file:
test-data-7mb.csv
Rows: 334 Columns: 7
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (4): start_id, end_id, strategy, json
dbl (3): distance, time_seconds, calories

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Reading route data
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
Converting json values to linestrings
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=04s