dcooley / sfheaders

Build sf objects from R and Rcpp
https://dcooley.github.io/sfheaders/
Other
74 stars 5 forks source link

performance for very many POLYGON #62

Closed mdsumner closed 4 years ago

mdsumner commented 4 years ago

I've found I can get faster results than sf_polygon just using R, by templating the first sfg and simply replicating that out for the remaining geometrys. I'm sorry about the example, it's just that my main VM is playing up and the proper code is not available atm. This example creates polygons from raster pixels, the code in spex::polygonize uses the template idea - but I get the same kind of results when I'm generating POLYGON from triangles.

The code I use is not generalized to other geometry types or to XYZ, XYM, XYZM, and it's only done for POLYGON without holes - but I think it's worth exploring if sfheaders can be faster?

(It would be tricky to generalize the approach here to polygons with holes and multipolygons, I don't want to replace sfheaders with R but find out if sfheaders can be faster - perhaps this is a special case and it's the nesting that requires time, I'm not sure).

Is it possible that the sfg creation is the slow part, is that worth doing just once and replacing the coordinates in a copy at the C++ level? (I'll try exploring there)

library(raster)
r <- raster::raster(matrix(0, 256, 256))
## centres
xy <- raster::coordinates(r)
## trick to generate polygons from raster pixels
corners <- do.call(rbind, lapply(split(t(xy), rep(seq_len(nrow(xy)), each = 2)), 
       function(x)  { ## x is a single centre point
         matrix(x, ncol = 1)[c(1, 1, 1, 1, 1), ] + matrix((raster::res(r)/2) * c(-1, -1, 
                                                                                 -1, 1, 
                                                                                 1, 1, 
                                                                                 1, -1, 
                                                                                 -1, -1), 
                                                          ncol = 2, byrow = TRUE)
         }))
df <- as.data.frame(corners)

df$polygon_id <- df$linestring_id <- rep(seq_len(ncell(r)), each = 5)
system.time({
  sfh <- sfheaders::sf_polygon(df, linestring_id = "linestring_id", polygon_id = "polygon_id")
})
# user  system elapsed 
# 31.933   0.012  31.932 

## spex uses the templating idea
system.time({
  spx <- spex::polygonize(r)
})
# user  system elapsed 
# 1.223   0.000   1.222 
mdsumner commented 4 years ago

Ugh, I mucked up a bit up there - having massive VM problems today, please ignore until I can sort it out - thanks!

dcooley commented 4 years ago

Ugh, I mucked up a bit up there - having massive VM problems today, please ignore until I can sort it out - thanks!

sure, but you bring up a good point

Note to me: this subset happens every iteration for every sfg - can I move it outside and call it once, similar to how I use get_sfc_coordinates()

mdsumner commented 4 years ago

Here is the grand reprex I was working on when my computer troubles forced the less satisfactory report above

#remotes::install_github("hypertidy/anglr")

## 1. worker function to convert mesh to df 
# (a long tortured path, just so we have nice triangles )
to_sfh_df.mesh3d <- function(x, ...) {
  idx <- if (!is.null(x$it)) x$it else x$ib
  each <- nrow(idx) + 1 ## to close polygon for sf
  ## we need to close loops for sf
  idx <- rbind(idx, idx[1, ])
  nc <- dim(idx)[2L]
  idx <- as.vector(idx)
  out <- tibble::as_tibble(
    cbind(x = x$vb[1L, idx],
          y = x$vb[2L, idx],
          z = x$vb[3L, idx], 
          linestring_id = rep(seq_len(nc), each = each), 
          polygon_id = rep(seq_len(nc), each = each)))
}

## 2.  a worker function to demonstrate blazing vectorized sf-polygon creation
## for a simple case of very many features
mk_direct <- function(df) {
  ## we're assume only POLYGON, XY
  ## template the sfg and populate copies of it is much faster
  template <- sfheaders::sf_polygon(data.frame(x = c(0, 0, 1, 1, 0), y = c(0, 1, 1, 0, 0), 
                                               linestring_id = 0, polygon_id = 0), 
                                    x = "x", y = "y", 
                                    linestring_id = "linestring_id", 
                                    polygon_id = "polygon_id")$geometry[[1]]

  ## this only works for POLYGON, and when they have no holes
  ## split the coordinates and populate copies of the template
  sflist <- lapply(split(t(as.matrix(df[c("x", "y")])), 
                         rep(df$linestring_id, each = 2)),  ## because we have 2-columns x, y
                   function(ixy) {
                     template[[1L]] <- matrix(ixy, ncol = 2, byrow = TRUE)

                     template
                   })

  ## generate the final sf df
  structure(list(geometry = structure(sflist, class = c("sfc_POLYGON", "sfc"), 
                                      precision = 0.0, bbox = structure(c(xmin = min(d2$x), ymin = min(df$y), xmax = max(df$x), ymax = max(df$y)), class = "bbox"), 
                                      crs = structure(list(epsg = NA_integer_, 
                                                           proj4string = NA_character_), 
                                                      class = "crs"), n_empty = 0L)), 
            class = c("sf", "data.frame"), sf_column = "geometry", agr = character(0L), row.names = seq_along(sflist))
}

## EXAMPLE DATA 
exdata <- silicate::inlandwaters

## very few, complex polygons
d1 <- sfheaders::sf_to_df(exdata)

## very many, simple polygons (triangles), using 
## obscure constrained triangulation to get a dense, high-quality mesh
d2 <- to_sfh_df.mesh3d(anglr::as.mesh3d(anglr::DEL(exdata, max_area = 1e10)))

## the usual situation, sfheaders is fast (very few features, many verts per feature)
system.time({sfheaders::sf_multipolygon(d1, 
                                      x = "x", y = "y", 
                                      linestring_id = "linestring_id", 
                                      polygon_id = "polygon_id", 
                                      multipolygon_id = "multipolygon_id")})

#  user  system elapsed 
#  0.150   0.004   0.155 

## a very specific situation, sfheaders is slow (very many features, all are triangles)
system.time({sfx <- sfheaders::sf_polygon(d2, 
                                      x = "x", y = "y", 
                                      linestring_id = "linestring_id", 
                                      polygon_id = "polygon_id")})
#>    user  system elapsed 
#>  24.745   0.000  24.744

## we can beat sfheaders with R code 
system.time({
  res <- mk_direct(d2)
})
#>    user  system elapsed 
#>   0.854   0.000   0.854

Created on 2020-03-14 by the reprex package (v0.3.0)

I'm pretty sure the key is templating the sfg and only creating it once, and while this example is only for triangles (or quads) the same copy the raw matrix into the template approach should work generally. (Though I still haven't explored the underlying code).

dcooley commented 4 years ago

This is now working (and quick) in geometries

system.time({
  gmx <- geometries::gm_make_geometries(
    df = d2
    , id_cols = c("polygon_id", "linestring_id")
    , geometry_cols = c("x","y")
    , class_attributes = c("XY","POLYGON","sfg")
  )
})
user  system elapsed 
0.015   0.000   0.016

(This is only for making geometries, not handling the data attributes, so is actually comparible to sfheaders::sfc_polygon )

mdsumner commented 4 years ago

Great!

dcooley commented 4 years ago

@mdsumner I've now got this update almost ready for release. Would you mind checking out this branch before I merge it to master and testing it still works on some of your existing codes?

remotes::install_github("dcooley/geometries")
remotes::install_github("dcooley/sfheaders", ref = "issue82")

(no rush - will take a few days for me to get geometries on CRAN)

mdsumner commented 4 years ago

well, it's FAST sheesh - no waiting whatsover, ramped it up so now the meshing is the slow part and it's less than a second to generate 325000 triangles from a df

That actually entirely shifts perspective it's amazing, thanks!

dcooley commented 4 years ago

cool. If you don't need the sf stuff you can also use {geometries} directly

mdsumner commented 4 years ago

indeed, well the old quadmesh stuff is now the slow part - time for me to level up again ;0