JosiahParry / rsgeo

R bindings to the geo Rust crate
https://rsgeo.josiahparry.com/
Other
45 stars 5 forks source link

Cool! #23

Open paleolimbot opened 1 year ago

paleolimbot commented 1 year ago

Feel free to close, just sharing some of my experiments. This is very cool! I imagine you don't have much control over the benchmarks since you're mostly just connecting the wires for rust functions, plus, I find that benchmarking GEOS geometries is pretty difficult because putting that many external pointers into a list() makes the garbage collector go a little haywire, which may affect some things here.

Any reason why you didn't prefix the functions? I mostly just typed rsgeo:: to get the autocomplete (but maybe that's what you were aiming for).

It might be nice (and considerably faster) if your rectangle-producing functions produced a wk_rct()? This would get them plot(), st_as_sf(), etc. for free (plus lets you propagate the CRS to whatever consumes the result).

Is the geometry type as part of the class member (e.g., rs_GEOMETRYCOLLECTION) necessary? It seems like it may be more efficient to only resolve the unique geometry types when required rather than loop over the result every time to build up this information?

# https://github.com/geoarrow/geoarrow-data/releases/download/latest-dev/ns-water-water_line-wkb.parquet
lines_tbl_wkb <- arrow::read_parquet(
  "~/Desktop/rscratch/geoarrow-data/ns-water/ns-water-water_line-wkb.parquet"
)

vec_wkb <- wk::wkb(lines_tbl_wkb$geometry)
vec_sf <- sf::st_as_sfc(vec_wkb)
vec_geos <- geos::as_geos_geometry(vec_wkb)
vec_s2 <- s2::s2_geog_from_wkb(vec_wkb)
vec_rsgeo <- rsgeo::as_rsgeo(vec_sf)

bench::mark(
  rsgeo::minimum_rotated_rect(vec_rsgeo),
  geos::geos_minimum_rotated_rectangle(vec_geos),
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 rsgeo::minimum_rotated_rect(vec_rsg… 3.08s  3.08s     0.324    3.69MB        0
#> 2 geos::geos_minimum_rotated_rectangl… 5.57s  5.57s     0.180   11.07MB        0

# ...probably because rust is using ellipsoidal and s2 is using spherical (faster)
bench::mark(
  rsgeo::length_geodesic(vec_rsgeo),
  s2::s2_length(vec_s2),
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 rsgeo::length_geodesic(vec_rsg…    8.69s    8.69s     0.115    3.69MB        0
#> 2 s2::s2_length(vec_s2)           359.53ms 362.05ms     2.76    25.85MB        0

bench::mark(
  rsgeo::length_euclidean(vec_rsgeo),
  geos::geos_length(vec_geos)
)
#> # A tibble: 2 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 rsgeo::length_euclidean(vec_rsge…   251ms 265.5ms      3.77    3.69MB        0
#> 2 geos::geos_length(vec_geos)        31.7ms  32.2ms     29.9     3.69MB        0

bench::mark(
  rsgeo::bounding_box(vec_rsgeo),
  wk::wk_bbox(vec_wkb),
  wk::wk_bbox(vec_geos),
  check = FALSE
)
#> # A tibble: 3 × 6
#>   expression                          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 rsgeo::bounding_box(vec_rsgeo)    256ms    256ms      3.91     2.1KB        0
#> 2 wk::wk_bbox(vec_wkb)              145ms    145ms      6.88   16.12KB        0
#> 3 wk::wk_bbox(vec_geos)             223ms    223ms      4.48    2.46KB        0

bench::mark(
  rsgeo::centroids(vec_rsgeo),
  s2::s2_centroid(vec_s2),
  geos::geos_centroid(vec_geos),
  check = FALSE,
  iterations = 3
)
#> # A tibble: 3 × 6
#>   expression                         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 rsgeo::centroids(vec_rsgeo)      1.11s    1.11s     0.899    3.69MB     2.70
#> 2 s2::s2_centroid(vec_s2)       365.99ms 388.47ms     2.48    11.07MB     0   
#> 3 geos::geos_centroid(vec_geos) 203.58ms 206.49ms     3.54    11.07MB     0

Created on 2023-08-27 with reprex v2.0.2

JosiahParry commented 1 year ago

Thank you for checking this out!! I very much appreciate your feedback :)) Some thoughts as I read along.

lines_tbl_wkb <- arrow::read_parquet(
  "/Users/josiahparry/Downloads/ns-water-water_line-wkb.parquet"
)

vec_wkb <- wk::wkb(lines_tbl_wkb$geometry)
vec_sf <- sf::st_as_sfc(vec_wkb)
vec_geos <- geos::as_geos_geometry(vec_wkb)
vec_s2 <- s2::s2_geog_from_wkb(vec_wkb)
vec_rsgeo <- rsgeo::as_rsgeo(vec_sf)

bench::mark(
  rsgeo::minimum_rotated_rect(vec_rsgeo),
  geos::geos_minimum_rotated_rectangle(vec_geos),
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression                                          min   median `itr/sec`
#>   <bch:expr>                                     <bch:tm> <bch:tm>     <dbl>
#> 1 rsgeo::minimum_rotated_rect(vec_rsgeo)            1.99s    1.99s     0.501
#> 2 geos::geos_minimum_rotated_rectangle(vec_geos)    6.18s    6.18s     0.162
#> # ℹ 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>

bench::mark(
  rsgeo::length_geodesic(vec_rsgeo),
  s2::s2_length(vec_s2),
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression                             min   median `itr/sec` mem_alloc
#>   <bch:expr>                        <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 rsgeo::length_geodesic(vec_rsgeo)    1.87s    1.87s     0.535    3.69MB
#> 2 s2::s2_length(vec_s2)             346.59ms 350.58ms     2.85    25.85MB
#> # ℹ 1 more variable: `gc/sec` <dbl>
paleolimbot commented 1 year ago

I think there could be a big improvement if list ALTREP is used (probably here and in geos) that way there could be just 1 pointer instead of n.

..and s2. I'd like to build some abstractions around ALTREP soon because I've implemented it a few times and it's very difficult to get right and it's totally undocumented. It is a lot of work to do (I'm sure you know this) which is why I haven't gotten to it. It is also mostly not a problem for the sizes of data that are typically encountered in R (although that's a chicken-and-egg problem...people might work with bigger data if it wasn't so awful).

It's also quite nice because having the class attribute actually allows me to not loop over each geometry to figure out its type.

...except for all the operations that don't require a specific geometry (notably, subsetting: vec_rsgeo[stuff]). You're making a bet that the unique geometry type will be accessed more than once, so pre-computing and caching it will pay off. Be very sure that this is the case before baking it in to the class attribute! It would be safer would be to implement geometry_class_name()...if in the future you want to implement the caching you can always update it to inspect the class name first.

I added parallelization to a few of these functions to pretend that they're faster now.

Cool! I've always wanted to do that with geos and s2...I imagine it's easier to code in rust. C++ doesn't have a built-in concept of a "thread pool" and I haven't kept up with the best ways to do this that don't involve vendoring huge swaths of unmaintainable C++.