hypertidy support - Githubissues

dcooley commented 4 years ago

@mdsumner - I figured it would be good to sketch out designs / requirements for supporting scilicate / hypertidy structures and the like.

I've started porting all the utility functions from sfheaders into here. I've added a few examples to the readme, and I think some of these will be applicable to your requirements (like the coordinate_indices() and id_positions()).

so, taking your split.data.frame requirement, what object would this operate on, and what does it give as an output?

TODO

[ ] unique coordinates
[ ] index map of coordinates to the unique set of coordinates

mdsumner commented 4 years ago

The split.data.frame is a small part of what PATH0 (paths), TRI0 (triangles), and SC0 (line segments) do:

identify unique vertices (in x-y)
collect indexes of (row numbers) of the vertex table that trace each geometry part
embed those indexes as a split data frame in object$path_ or object$topology_

library(silicate)
#> 
#> Attaching package: 'silicate'
#> The following object is masked from 'package:stats':
#> 
#>     filter

x <- minimal_mesh

## what PATH0 does
p <- PATH0(x)
## every path_ element is the index-trace through p$vertex, plus records which path we are
lapply(p$object$path_, function(x) x[c("vertex_", "path_")])
#> $`1`
#> # A tibble: 14 x 2
#>    vertex_ path_
#>      <int> <int>
#>  1       1     1
#>  2       2     1
#>  3      10     1
#>  4      12     1
#>  5       8     1
#>  6      11     1
#>  7       9     1
#>  8       1     1
#>  9       3     2
#> 10       6     2
#> 11       7     2
#> 12       5     2
#> 13       4     2
#> 14       3     2
#> 
#> $`2`
#> # A tibble: 5 x 2
#>   vertex_ path_
#>     <int> <int>
#> 1       9     3
#> 2      11     3
#> 3      13     3
#> 4      14     3
#> 5       9     3

## what SC0 does
sc <- SC0(x)
## every topology_ element is the segment-start-end-trace through sc$vertex, plus what path we are
## .vx0, .vx1 are segment end pairs
lapply(sc$object$topology_, function(x) x)
#> $`1`
#> # A tibble: 12 x 3
#>     .vx0  .vx1 path_
#>    <int> <int> <int>
#>  1     1     2     1
#>  2     2    10     1
#>  3    10    12     1
#>  4    12     8     1
#>  5     8    11     1
#>  6    11     9     1
#>  7     9     1     1
#>  8     3     6     2
#>  9     6     7     2
#> 10     7     5     2
#> 11     5     4     2
#> 12     4     3     2
#> 
#> $`2`
#> # A tibble: 4 x 3
#>    .vx0  .vx1 path_
#>   <int> <int> <int>
#> 1     9    11     3
#> 2    11    13     3
#> 3    13    14     3
#> 4    14     9     3

## what TRI0 does
tri <- TRI0(x)
## every topology_ element is the triangle-corner-trace through sc$vertex, plus what path we are
## .vx0, .vx1, .vx2 are triangle corner triplets
lapply(tri$object$topology_, function(x) x)
#> $`1`
#> # A tibble: 12 x 4
#>     .vx0  .vx1  .vx2 path_
#>    <int> <int> <int> <int>
#>  1     1     3     4     1
#>  2     6     3     1     1
#>  3     9    11     8     1
#>  4     8    12    10     1
#>  5     2     1     4     1
#>  6     6     1     9     1
#>  7     8    10     2     1
#>  8     2     4     5     1
#>  9     7     6     9     1
#> 10     8     2     5     1
#> 11     7     9     8     1
#> 12     8     5     7     1
#> 
#> $`2`
#> # A tibble: 2 x 4
#>    .vx0  .vx1  .vx2 path_
#>   <int> <int> <int> <int>
#> 1    11     9    14     3
#> 2    14    13    11     3

^{Created on 2020-04-29 by the reprex package (v0.3.0)}

There are only 14 vertices, but they occur a total of 19 times (3 to close each path, 2 are on a shared segment).

So, one step back from object$topology_ and object$path_ is a split(instances, instances$object) step which is much like this (though missing a nesting level):

polygon(do.call(rbind, sc$object$topology_), c(".vx0", ".vx1"), "path_")
[[1]]
     [,1] [,2]
[1,]    1    2
[2,]    2   10
[3,]   10   12
[4,]   12    8
[5,]    8   11
[6,]   11    9
[7,]    9    1

[[2]]
     [,1] [,2]
[1,]    3    6
[2,]    6    7
[3,]    7    5
[4,]    5    4
[5,]    4    3

[[3]]
     [,1] [,2]
[1,]    9   11
[2,]   11   13
[3,]   13   14
[4,]   14    9

So, I'm wondering if I should start to rejig things towards using matrices instead - but by far the messiest stuff in silicate is the de-duplication, which is a step prior to the split() part so I'm looking into that.

mdsumner commented 4 years ago

(oop also clearly there's a bug in TRI0() not recording the path properly, but fwiw it's not used anywhere yet)

mdsumner commented 4 years ago

And I went and tried the listList approach with great success, better sense than above with polygon()

cppFunction(
  depends = "geometries"
  , includes = '#include "geometries/shapes/shapes.hpp"'
  , code = '
    SEXP mpolygon( SEXP df, SEXP geometry_cols, SEXP line_id , SEXP poly_id) {
      return geometries::shapes::get_listListMat( df, geometry_cols, line_id, poly_id );
    }
  '
)
df <- dplyr::bind_rows(sc$object$topology_, .id = "object_")
mpolygon(df, c(".vx0", ".vx1"), "object_", "path_")

[[1]]
[[1]][[1]]
     [,1] [,2]
[1,]    1    2
[2,]    2   10
[3,]   10   12
[4,]   12    8
[5,]    8   11
[6,]   11    9
[7,]    9    1

[[1]][[2]]
     [,1] [,2]
[1,]    3    6
[2,]    6    7
[3,]    7    5
[4,]    5    4
[5,]    4    3

[[2]]
[[2]][[1]]
     [,1] [,2]
[1,]    9   11
[2,]   11   13
[3,]   13   14
[4,]   14    9

dcooley commented 4 years ago

but by far the messiest stuff in silicate is the de-duplication

So true. I think a "unique_coordinates()" would be a really useful feature. But I haven't found an easy way to impement this yet.

mdsumner commented 4 years ago

I got a bit interested in rgl, in as.mesh3d() it explains doing it for 3D vis, and optionally nominally specifying that vertices are not-equal even if they are - but it's all done in an R loop ...

What I've used in the past is

duplicated()
as.integer(factor(do.call(paste0, df))) - by far the fastest
dplyr::distinct()
dplyr::group_indices (the engine for distinct)
unjoin

I've used unjoin mostly because it's quite general and very easy to work with, but dm::decompose_tables is a bit faster.

And with all that, I was a bit scared of the whole thing but it's time to revisit ;)

mdsumner commented 4 years ago

Here is a re-imagined PATH0(). I had a few scary moments, but this is much clearer than the current source. It's not faster, and the major thing making it as fast as it is is dplyr::group_indices() in place of as.integer(factor(paste(...)))

(De-duplication was always planned to be general, but I think non x-y is now better treated specially, for various reasons - but it seems unjoin still wins on that front).

So, a non-dplyr group_indices() would be a powerful tool.

new_meta <- function(crs) {
  tibble::tibble(crs = crs, ctime = Sys.time())
}
new_PATH0 <- function(vertex, object, index, crs = NA_character_) {
  object[["path_"]] <- index
  structure(list(object = object, 
            vertex = vertex, 
            meta = new_meta(crs)), 
            class = c("path0", "PATH0", "sc"))
}
library(silicate)
#> 
#> Attaching package: 'silicate'
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(sfheaders)

path0 <- function(x) {
  df <- sf_to_df(x)
  ## vertex_ is "unique coord"
  #df[["vertex_"]] <-  as.integer(factor(paste(df[["y"]], df[["x"]], sep = "|")))
  df[["vertex_"]] <-  dplyr::group_indices(df, .data$x, .data$y)

  ## the vertex table, separated out (we need vertex_ to remap)
  v <- df[!duplicated(df[["vertex_"]]), c("x", "y", "vertex_")]

  ## now remap (can this be done better?) 
  ## (it's the single most confusing thing I think ... and
  ## I ended up relying on unjoin)
  df[["vertex_"]] <- match(df$vertex_, v$vertex_)

  ## cleanup
  v[["vertex_"]] <- NULL
  df[["x"]] <- NULL
  df[["y"]] <- NULL

  ## feature id, object_, whatever the top level "keep" table has
  o <- df[!duplicated(df["sfg_id"]), "sfg_id", drop = FALSE]

  ## path_ just for compatibility with current silicate
  df[["path_"]] <- df[["linestring_id"]]
  ## only multipolygon
  if ("multipolygon_id" %in% names(df)) {
    df[["subobject"]] <- df[["polygon_id"]]
  } else {
    df[["subobject"]] <- 1
  }

  ## the embeded indexes of the sequential paths, with split.data.frame
  idx <- split(df[c("vertex_", "path_", "subobject")], df[["sfg_id"]])

  names(v) <- c("x_", "y_")  ## and these, compat with current silicate
  df$sfg_id <- NULL
  crs <- NA_character_
  new_PATH0(v, o, idx, crs = crs)
}

plot(path0(inlandwaters))

# plot(path0(sf::st_cast(inlandwaters, "POLYGON")))
# plot(path0(sf::st_cast(inlandwaters, "MULTILINESTRING")))

plot(path0(minimal_mesh))

#plot(path0(sf::st_sf(g = sf::st_sfc(sfzoo$multipolygon))))
#plot(TRI0(path0(minimal_mesh)), col = c("grey", "firebrick"), border = "black")
## etc. 
rbenchmark::benchmark(path0(inlandwaters), 
                      PATH0(inlandwaters), replications = 20)
#>                  test replications elapsed relative user.self sys.self
#> 1 path0(inlandwaters)           20    1.99    1.137      1.83     0.15
#> 2 PATH0(inlandwaters)           20    1.75    1.000      1.61     0.15
#>   user.child sys.child
#> 1         NA        NA
#> 2         NA        NA

^{Created on 2020-04-29 by the reprex package (v0.3.0)}

(Trust you don't mind me using this issue for reams of examples, it's helpful for me to air it in this way).

dcooley commented 4 years ago

(Trust you don't mind me using this issue for reams of examples, it's helpful for me to air it in this way).

Exactly what I need too.

mdsumner commented 4 years ago

One key part of the design of silicate that is missing above in path0() is the use of the "worker verbs", where sf_to_df() is doing the job of sc_coord() (all coordinates, in order) and sc_path() (a map of the paths, number of coordinates). So generic workers can be defined then packages with special requirements could add methods. A harmless "spatial generics" package would be imported and then an export of sc_coord.myspecialclass would provide the right functionality in presence of special package.

Then, models like PATH0(), TRI0() etc are composed only of generic code doing standard stuff, and can be defined on the fly if needed. If a model stores unique vertices (i.e. mesh3d()) then sc_coord() returns the expanded set of instances, if it stores all instances then sc_vertex() returns the denser set. It did sort of work, but needs community adoption to really fly. Your efforts to have libraries doing every form of manipulation are a strong basis for revisiting that. (I always loved coordinates() in sp, it works perfectly for SpatialPoints and Raster, but it gives different forms for lines and polygons - it's that predictability for automating that set me down this road from sp days, and in its way spbabel provided the first leg-up towards silicate).

mdsumner commented 4 years ago

A kind of progress report: I'm in the middle of a TRI0 cleanup using sf_to_df, and it's trivial to run decido over that but then I get to the "normalize vertices" part, do that before triangulating and grab each coordinate set via index, or after? It's compelling to do it after, or even just leave it but - you can't do edge-based triangulation (what anglr::DEL0() does) without normalizing the vertices first.

I'm hoping now to at least cleanup the way this is done, with all the index juggling that's grown organically - into something easier for us to discuss.

dcooley commented 4 years ago

with all the index juggling that's grown organically

If it helps, I'm working on an update to decido::Earcut .hpp so it returns coordinates directly, not indices, and will return either a data.frame with a trinagle_id, a list[[ list[[ matrix ]] ]] structure (i.e., a multipolygon), or an interleaved vector [x, y, x, y, x, y, ... ]

mdsumner commented 4 years ago

This has been so worthwhile, faster, easier:

new_TRI0 <- function(vertex, object, index, crs = NA_character_, meta = NULL) {
  meta1 <- tibble::tibble(proj = crs, ctime = Sys.time())
  if (!is.null(meta)) {
    meta <- rbind(meta1, meta)
  }
  object[["topology_"]] <- index
  structure(list(object = object, vertex = vertex,
                 meta = meta), class = c("TRI0", "sc"))
}

## build TRI0 with sfheaders

tri0 <- function(x, ...) {
  df <- sfheaders::sf_to_df(x)
  crs <- crsmeta::crs_proj(x)
  if (length(grep("POLYGON", class(x[[attr(x, "sf_column")]]) )) < 1) {
    stop("only polygon")
  }
  x[[attr(x, "sf_column")]] <- NULL
  object <- tibble::as_tibble(x)
  object$object_ <- 1:nrow(object)

  ## deduplicate in xy
  df[["vertex_"]] <-  dplyr::group_indices(df, .data$x, .data$y)
  ## the vertex table, separated out (we need vertex_ to remap)
  v <- df[!duplicated(df[["vertex_"]]), c("x", "y", "vertex_")]
  ## now remap (can this be done better?) 
  ## (alt. is unjoin())
  df[["vertex_"]] <- match(df$vertex_, v$vertex_)
  df[["path_"]] <- as.integer(factor(df[["linestring_id"]]))
  ## cleanup
  v[["vertex_"]] <- NULL
  df[["x"]] <- NULL  ## not really necessary to remove but highights the point
  df[["y"]] <- NULL  ## that these are now indexed in 'v'
  featurelist <- split(df, df$sfg_id)
  feature_triangles <- vector("list", length(featurelist))
  for (i in seq_along(featurelist)) {
    polylist <- split(featurelist[[i]], 
                      featurelist[[i]]$polygon_id)
    polytriangles <- vector("list", length(polylist))
    for (j in seq_along(polylist)) {
      ## j == 1 if POLYGON
      polygon <- polylist[[j]]
      ## in the past it was less clear to me how this
      ## coordinate table relates to the vertex table (much simpler now)
      holes <- which(c(0, abs(diff(polygon$linestring_id))) > 0)
      if (length(holes) < 1) {
        holes <- 0
      }
      xy <- cbind(v[["x"]][polygon$vertex_], 
                  v[["y"]][polygon$vertex_])
      tris <- matrix(decido::earcut(xy, holes), ncol = 3L, byrow = TRUE)
      polytriangles[[j]] <- tibble::tibble(.vx0 = polygon$vertex_[tris[,1L]], 
                                           .vx1 = polygon$vertex_[tris[,2L]],
                                           .vx2 = polygon$vertex_[tris[,3L]], 
                                           path_ = polygon$path_[1L])
      #decido::plot_ears(xy, trindex)
      #Sys.sleep(0.1)
    }
    feature_triangles[[i]] <- do.call(rbind, polytriangles)
  }
  names(v) <- c("x_", "y_")
  new_TRI0(v, object, feature_triangles, crs = crs)
}
library(silicate)
#> 
#> Attaching package: 'silicate'
#> The following object is masked from 'package:stats':
#> 
#>     filter
plot(tri0(inlandwaters))

## reasonable speed up
rbenchmark::benchmark(tri0(inlandwaters), TRI0(inlandwaters), 
                      replications = 7)
#>                 test replications elapsed relative user.self sys.self
#> 1 tri0(inlandwaters)            7    3.22    1.000      3.17     0.03
#> 2 TRI0(inlandwaters)            7   10.79    3.351     10.50     0.28
#>   user.child sys.child
#> 1         NA        NA
#> 2         NA        NA

nc <- sf::read_sf(system.file("gpkg/nc.gpkg", package = "sf", mustWork = TRUE))
plot(tri0(nc))

^{Created on 2020-05-04 by the reprex package (v0.3.0)}

mdsumner commented 4 years ago

with all the index juggling that's grown organically

If it helps, I'm working on an update to decido::Earcut .hpp so it returns coordinates directly, not indices, and will return either a data.frame with a trinagle_id, a list[[ list[[ matrix ]] ]] structure (i.e., a multipolygon), or an interleaved vector [x, y, x, y, x, y, ... ]

It doesn't help here because having an index into unique coordinates is really key to the current structure of the silicate models - but it definitely helps in general for flexibility! (The edge-based triangulation was where I started this whole thing - and the automatic topology you get from shared vertices was key - so it's in there as a core even though it doesn't always need to be. Ear-cutting was a step back from RTriangle's license requirement, so that silicate could triangulate on its own, and it's also very different - path vs. edge triangulating for polys).

mdsumner commented 4 years ago

Here's SC0, no speed up but comparable. The key to speed in this was using group_indices() (not factor/paste) and collating the indexes of segments as matrices in the inner loop (in .path2seg), and only converting to tibble for that at the feature level.

The group_indices stuff maintains path ID (i.e. linestring_id) but it doesn't record the parent polygon_id or multilinestring id - I think it otherwise works for any line or polygon input.

  new_SC0 <- function(vertex, object, index, crs = NA_character_, meta = NULL) {
  meta1 <- tibble::tibble(proj = crs, ctime = Sys.time())
  if (!is.null(meta)) {
    meta <- rbind(meta1, meta)
  }
  object[["topology_"]] <- index
  structure(list(object = object, vertex = vertex,
                 meta = meta), class = c("SC0", "sc"))
}

## build SC0 with sfheaders

sc0 <- function(x, ...) {
  df <- sfheaders::sf_to_df(x)
  crs <- crsmeta::crs_proj(x)

  x[[attr(x, "sf_column")]] <- NULL
  object <- tibble::as_tibble(x)
  object$object_ <- 1:nrow(object)

  ## deduplicate in xy
  df[["vertex_"]] <-  dplyr::group_indices(df, .data$x, .data$y)
  ## the vertex table, separated out (we need vertex_ to remap)
  v <- df[!duplicated(df[["vertex_"]]), c("x", "y", "vertex_")]
  ## now remap (can this be done better?) 
  ## (alt. is unjoin())
  df[["vertex_"]] <- match(df$vertex_, v$vertex_)

  ## cleanup
  v[["vertex_"]] <- NULL
  df[["x"]] <- NULL  ## not really necessary to remove but highlights the point
  df[["y"]] <- NULL  ## that these are now indexed in 'v'

  ## a global linestring_id 
  if ("multipolygon_id" %in% names(df)) {
    df[["path_"]] <- dplyr::group_indices(df, .data$sfg_id, 
                                              .data$polygon_id, 
                                              .data$linestring_id)

  } else {
    df[["path_"]] <- dplyr::group_indices(df, .data$sfg_id, 
                                          .data$linestring_id)

  }
  featurelist <- split(df, df$sfg_id)
  feature_segments <- vector("list", length(featurelist))
  .path2seg <- function(x, pathid = NULL) {
    cbind(.vx0 = x[-length(x)], .vx1 = x[-1L], path_ = pathid)
  }

  for (i in seq_along(featurelist)) {
    segments <- lapply(split(featurelist[[i]][c("vertex_", "path_")], 
                      featurelist[[i]]$path_), 
                      function(lstring) .path2seg(lstring[["vertex_"]], 
                                                  pathid = lstring[["path_"]][1L]))
    feature_segments[[i]] <- tibble::as_tibble(do.call(rbind, segments))
  }
  names(v) <- c("x_", "y_")
  new_SC0(v, object, feature_segments, crs = crs)
}
library(silicate)
#> 
#> Attaching package: 'silicate'
#> The following object is masked from 'package:stats':
#> 
#>     filter
plot(sc0(inlandwaters))
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
suppressWarnings(plot(sc0(st_cast(inlandwaters, "POLYGON"))))

suppressWarnings(xx <- sc0(st_cast(st_cast(inlandwaters, "POLYGON"), "LINESTRING")))
print(xx)
#> class       : SC0
#> type        : Structural
#> vertices    : 30835 (2-space)
#> primitives  : 33455 (2-space)
#> crs         : NA
xx$object$color_ <- viridis::viridis(nrow(xx$object))
plot(xx)


rbenchmark::benchmark(sc0(inlandwaters), SC0(inlandwaters), 
                      replications = 7)
#>                test replications elapsed relative user.self sys.self user.child
#> 1 sc0(inlandwaters)            7    0.96    1.032      0.88     0.08         NA
#> 2 SC0(inlandwaters)            7    0.93    1.000      0.89     0.02         NA
#>   sys.child
#> 1        NA
#> 2        NA

nc <- sf::read_sf(system.file("gpkg/nc.gpkg", package = "sf", mustWork = TRUE))
nc$color_ <- rainbow(nrow(nc), alpha = 0.6)
plot(sc0(nc))

^{Created on 2020-05-04 by the reprex package (v0.3.0)}

mdsumner commented 4 years ago

Just a final one to show that these are segments, we can address them individually

sc0(nc[1, ])
class       : SC0
type        : Structural
vertices    : 26 (2-space)
primitives  : 26 (2-space)
crs         : NA
nc$color_ <- NULL
plot(sc0(nc[1, ]), col = sample(viridis::viridis(26)), lwd = c(4, 1))

mdsumner commented 4 years ago

The non-dplyr group_indices is vctrs::vec_group_id hooray

mdsumner commented 4 years ago

Here I try direct from file so we can control GDAL directly to return WKB, then unpack with @paleolimbot's wk!

(we need a bit of a group chat about column names ?)

  #remotes::install_github("paleolimbot/wk")

library(wk)
library(vapour)
file <- "list_locality_postcode_meander_valley.tab"
layer <- "list_locality_postcode_meander_valley"
## A MapInfo TAB file with polygons
f <- system.file(file.path("extdata/tab", file), package="vapour")
g <- tibble::tibble(g = vapour_read_geometry(f))
dim(wk::wkb_coords(g$g))
#> [1] 55387     9

new_TRI0 <- function(vertex, object, index, crs = NA_character_, meta = NULL) {
  meta1 <- tibble::tibble(proj = crs, ctime = Sys.time())
  if (!is.null(meta)) {
    meta <- rbind(meta1, meta)
  }
  object[["topology_"]] <- index
  structure(list(object = object, vertex = vertex,
                 meta = meta), class = c("TRI0", "sc"))
}

## build TRI0 with wk and vapour (along the lines of `tri0()` above but direct via GDAL)

tri_file <- function(x, ...) {
  df <- wk::wkb_coords(vapour::vapour_read_geometry(x, ...))
  df$linestring_id <- df$ring_id
  df$polygon_id <- df$part_id
  df$sfg_id <- df$feature_id
  if (anyNA( df$polygon_id)) {
    df$polygon_id[is.na(df$polygon_id)] <- 0
  }

  #crs <- crsmeta::crs_proj(x)
  object <- tibble::tibble(object_ = seq_len(length(unique(df$feature_id))))

  ## deduplicate in xy
  df[["vertex_"]] <-  dplyr::group_indices(df, .data$x, .data$y)
  ## the vertex table, separated out (we need vertex_ to remap)
  v <- df[!duplicated(df[["vertex_"]]), c("x", "y", "vertex_")]
  ## now remap (can this be done better?) 
  ## (alt. is unjoin())
  df[["vertex_"]] <- match(df$vertex_, v$vertex_)
  df[["path_"]] <- as.integer(factor(df[["linestring_id"]]))
  ## cleanup
  v[["vertex_"]] <- NULL
  df[["x"]] <- NULL  ## not really necessary to remove but highights the point
  df[["y"]] <- NULL  ## that these are now indexed in 'v'
  featurelist <- split(df, df$sfg_id)
  feature_triangles <- vector("list", length(featurelist))
  for (i in seq_along(featurelist)) {
    polylist <- split(featurelist[[i]], 
                      featurelist[[i]]$polygon_id)
    polytriangles <- vector("list", length(polylist))
    for (j in seq_along(polylist)) {
      ## j == 1 if POLYGON
      polygon <- polylist[[j]]
      ## in the past it was less clear to me how this
      ## coordinate table relates to the vertex table (much simpler now)
      holes <- which(c(0, abs(diff(polygon$linestring_id))) > 0)
      if (length(holes) < 1) {
        holes <- 0
      }
      xy <- cbind(v[["x"]][polygon$vertex_], 
                  v[["y"]][polygon$vertex_])
      tris <- matrix(decido::earcut(xy, holes), ncol = 3L, byrow = TRUE)
      polytriangles[[j]] <- tibble::tibble(.vx0 = polygon$vertex_[tris[,1L]], 
                                           .vx1 = polygon$vertex_[tris[,2L]],
                                           .vx2 = polygon$vertex_[tris[,3L]], 
                                           path_ = polygon$path_[1L])
      #decido::plot_ears(xy, trindex)
      #Sys.sleep(0.1)
    }
    feature_triangles[[i]] <- do.call(rbind, polytriangles)
  }
  names(v) <- c("x_", "y_")

  new_TRI0(v, object, feature_triangles, crs = NA_character_)
}

library(silicate)
#> 
#> Attaching package: 'silicate'
#> The following object is masked from 'package:stats':
#> 
#>     filter
## now we can leverage vapour skip_n/limit_n/sql etc. 
plot(tri_file(f, limit_n = 10))


plot(tri_file(f, sql = glue::glue("SELECT * FROM {layer} WHERE FID < 10")))


xx <- tri_file(f)
plot(xx, border = "grey")

xx$object$color_ <- viridis::viridis(nrow(xx$object))
anglr::mesh_plot(xx)

#anglr::plot3d(xx)  ## etc

^{Created on 2020-05-06 by the reprex package (v0.3.0)}

paleolimbot commented 4 years ago

Cool!

My topology game is not great, but I think you could define a C++ WKGeometryHandler that creates your vertex tables (hash table?) on the fly from a WKB coordinate stream (if there were a WKBytesProvider that operated directly on the GDAL level, the WKB would never even have to be loaded into R!)

Happy to change the column names to make the coord function play nicer with this!

mdsumner commented 4 years ago

There's definitely no "one right way", and your nest column reminds me that we haven't covered enough bases.

I don't mind changing names at all but the NA fill seems unnecessary?

Thinking aloud, many functions that do a variety of encodings is important - but what about the two table approach as a core? One table is just geometry (xy or xyz etc) the other the run length of each linestring. Then you can have many classifiers at low storage cost, or eventually as virtual ALTREP columns like the vctrs rle stuff.

dcooley commented 4 years ago

but what about the two table approach as a core? One table is just geometry (xy or xyz etc) the other the run length of each linestring

Is that similar in concept to the prototype here?

where $data is the table of data, the other list element describe the geometry

mdsumner commented 4 years ago

Yes exactly!

 silicate::sc_path(sf)
# A tibble: 18,286 x 6
    ncol type       object_ subobject path_  ncoords_
   <int> <chr>      <chr>       <dbl> <chr>     <int>
 1     2 LINESTRING zXpe9i          1 owja2l       20
 2     2 LINESTRING KXeft1          1 saFXj6       26
 3     2 LINESTRING aghWBa          1 5OJWj4       16
 4     2 LINESTRING S58Z1Q          1 RUORz7        8
...
# ... with 18,276 more rows

silicate::sc_coord(sf)
# A tibble: 57,757 x 2
      x_    y_
   <dbl> <dbl>
 1  145. -37.8
...

(the IDs in sc_path are globally unique because reasons, but that leverages gibble to get the raw mapping).

mdsumner commented 4 years ago

object_ is feature_id

subobject is "polygon within feature" id (and easier to always be present, 1 for non-mpoly)

path_ is linestring_id

mdsumner commented 4 years ago

I chose those classifiers because they are "interpretation independent", only "path" has a special meaning analogous to "linestring". subobject is required for multipolygons, and only increments if there's another POLYGON. Then there's no need to classify "hole" or otherwise, it's just that "path" within object/subobject are by convention holes for polygons - otherwise they are just linestrings.

The idea then was that the "type" nominates how you interpret things, for multilinestring you forget subobject and you just have all path_ within feature, with multipolygon the subobject controls the extra level. Global IDs allow to subset arbitrarily but I think that's less important than this more fundamental common stuff.

dcooley commented 4 years ago

then I think now my understanding is finally where yours was at a few years ago! This is also similar to the discussions around Arrow and GeoPandas. The big difference with silicate, as you point out, is the unique-ness.

mdsumner commented 4 years ago

Yes that is probably overdone in hindsight because of where I started. All good, exciting times!

paleolimbot commented 4 years ago

My understanding is still way behind...I did make a wk(b|t)_meta() function that is sort of like sc_path() but still doesn't return quite enough information to roundtrip (roundtripping isn't really the point of those functions though).

When writing my coords thinger I was wondering if it would be more useful to have wkb_coords_point(), wkb_coords_linestring()...etc. That would solve the NA col problem and make the columns returned by each function stable. Is it really useful to represent arbitrary collections in flat format? My use is mostly plotting, where the answer is probs not.

mdsumner commented 4 years ago

You're right it's absolutely not. I'm just really deep into it as a workaround for two in-memory formats. With all these new approaches it'll fade away because we have flexibility to cut to the chase 🤗

mdsumner commented 4 years ago

And yes to those functions that's precisely the right flexibility!

mdsumner commented 4 years ago

I'm leaning more and more towards a simpler model at the base of silicate, this is idxTRI0 - it's two tables, object (features) with topology_ list of triangle index matrices, and coord - all the coordinates in order, no dedupe.

new_idxTRI0 <- function(coord, object, index, crs = NA_character_, meta = NULL) {
  meta1 <- tibble::tibble(proj = crs, ctime = Sys.time())
  if (!is.null(meta)) {
    meta <- rbind(meta1, meta)
  }
  object[["topology_"]] <- index

  structure(list(object = object, coord = coord,
                 meta = meta), class = c("idxTRI0", "sc"))
}

# pure metal index
idxTRI0 <- function(x, ...) {
 df <- sfheaders::sf_to_df(x)
 df$coord_ <- seq_len(dim(df)[1L])
    crs <- crsmeta::crs_proj(x)
    if (length(grep("POLYGON", class(x[[attr(x, "sf_column")]]) )) < 1) {
      stop("only polygon")
    }
    x[[attr(x, "sf_column")]] <- NULL
    object <- tibble::as_tibble(x)
    object$object_ <- 1:nrow(object)

    featurelist <- split(df, df$sfg_id)
    feature_triangles <- vector("list", length(featurelist))
    for (i in seq_along(featurelist)) {
      polylist <- split(featurelist[[i]], 
                        featurelist[[i]]$polygon_id)
      polytriangles <- vector("list", length(polylist))
      for (j in seq_along(polylist)) {
        polygon <- polylist[[j]]
        holes <- which(c(0, abs(diff(polygon$linestring_id))) > 0)
        if (length(holes) < 1) {
          holes <- 0
        }
        xy <- polygon[c("x", "y")]
        tris <- matrix(decido::earcut(xy, holes), ncol = 3L, byrow = TRUE)
        polytriangles[[j]] <- cbind(.vx0 = polygon$coord_[tris[,1L]], 
                                             .vx1 = polygon$coord_[tris[,2L]],
                                             .vx2 = polygon$coord_[tris[,3L]])
      }
      feature_triangles[[i]] <- do.call(rbind, polytriangles)
    }
    coords <- df[c("x", "y")]
    names(coords) <- c("x_", "y_")
    new_idxTRI0(coords, object, feature_triangles, crs = crs)
}

There's no print, plot, or any support in silicate yet (the 'coord' - not 'vertex' - table is a break from every other model, so sc_coord and sc_vertex will need some attention).

But it's compelling:

system.time(TRI0(inlandwaters))
   user  system elapsed 
   1.58    0.09    1.67 
system.time(idxTRI0(inlandwaters))
   user  system elapsed 
    0.1     0.0     0.1

Given everything I've learnt recently, the de-duplication has been unnecessary a lot of the time (it's required for constrained triangulation with RTriangle where I started, and it'll always be necessary for high quality triangulation - and all the topology stuff requires it, but do when necessary is I think better).

dcooley commented 4 years ago

Leaving this code here while I experiment

library(Rcpp)

cppFunction(
  depends = c("decido", "sfheaders")
  , include = c(
    '#include "decido/decido.hpp"'
    , '#include "sfheaders/df/sf.hpp"'
  )
  , code = '
    SEXP tris( Rcpp::DataFrame sf ) { 

      std::string geom_col = sf.attr("sf_column");
      Rcpp::List sfc = sf[ geom_col ];

      Rcpp::DataFrame df = sfheaders::df::sf_to_df( sf, false );

      Rcpp::NumericVector x = df["x"];
      Rcpp::NumericVector y = df["y"];

      R_xlen_t i;
      R_xlen_t n = sfc.length();
      Rcpp::List res( n );
      R_xlen_t total_rows = 0; // for keeping track of final number of coordinates 

      for( i = 0; i < n; ++i ) {
        SEXP polygon = sfc[ i ];
        Rcpp::IntegerVector idx = decido::api::earcut( polygon );
        Rcpp::NumericVector xx = x[ idx ];
        Rcpp::NumericVector yy = y[ idx ];
        total_rows = total_rows + xx.length();
        res[ i ] = Rcpp::List::create(
          Rcpp::_["x"] = xx,
          Rcpp::_["y"] = yy
        );
      }
      // TODO - collapse_list is moving to geometrires::
      Rcpp::List coords = sfheaders::df::collapse_list( res, total_rows );

      return Rcpp::List::create(
        Rcpp::_["data"] = sf, // TODO: remove geometry
        Rcpp::_["coord"] = coords
      );
    }
  '
)

sf <- sfheaders::sf_cast( silicate::inlandwaters, "POLYGON" )

tri1 <- idxTRI0(sf)
tri2 <- tris( sf )

microbenchmark::microbenchmark(
  sc = { idxTRI0(sf) },
  cpp = { tris( sf ) },
  times = 25
)

# Unit: milliseconds
# expr      min       lq     mean   median       uq      max neval
#  sc 82.55239 84.01620 85.60736 84.72283 86.78654 91.75369    25
# cpp 15.44566 16.09095 17.82840 17.75326 19.33051 20.44533    25

dcooley commented 4 years ago

still experimenting - this time without sfheaders and using geometries in stead. Performance is the same (as expected; the code is almost identical).

library(Rcpp)

cppFunction(
  depends = c("decido", "geometries")
  , include = c(
    '#include "decido/decido.hpp"'
    , '#include "geometries/coordinates/coordinates_impl.hpp"'
  )
  , code = '
    SEXP tris( Rcpp::DataFrame sf ) { 

      std::string geom_col = sf.attr("sf_column");
      Rcpp::List sfc = sf[ geom_col ];

      Rcpp::DataFrame df = geometries::coordinates::coordinates_impl( sfc );

      Rcpp::NumericVector x = df["c1"];
      Rcpp::NumericVector y = df["c2"];

      R_xlen_t i;
      R_xlen_t n = sfc.length();
      Rcpp::List res( n );
      R_xlen_t total_rows = 0; // for keeping track of final number of coordinates 

      for( i = 0; i < n; ++i ) {
        SEXP polygon = sfc[ i ];
        Rcpp::IntegerVector idx = decido::api::earcut( polygon );
        Rcpp::NumericVector xx = x[ idx ];
        Rcpp::NumericVector yy = y[ idx ];
        total_rows = total_rows + xx.length();
        res[ i ] = Rcpp::List::create(
          Rcpp::_["x"] = xx,
          Rcpp::_["y"] = yy
        );
      }

      Rcpp::List coords = geometries::utils::collapse_list( res, total_rows );

      return Rcpp::List::create(
        Rcpp::_["data"] = sf, // TODO: remove geometry
        Rcpp::_["coord"] = coords
      );
    }
  '
)

sf <- sfheaders::sf_cast( silicate::inlandwaters, "POLYGON" )

tri1 <- idxTRI0(sf)
tri2 <- tris( sf )

microbenchmark::microbenchmark(
  sc = { idxTRI0(sf) },
  cpp = { tris( sf ) },
  times = 25
)

# Unit: milliseconds
# expr      min       lq     mean   median       uq      max neval
#  sc 82.92683 84.99317 88.46958 89.41726 91.01554 93.91182    25
# cpp 15.29395 15.54418 17.06758 15.93848 16.80718 22.00044    25

mdsumner commented 4 years ago

@dcooley just FYI, this is related as the "packed SpatVector" in the terra package uses a run-length encode for vector objects (much like what gibble::gibble() does, and used by silicate::PATH0:

https://github.com/rspatial/terra/issues/50#issuecomment-646287329

i.e. see "index" in str(pack(v))

library(terra)
f <- system.file("ex/lux.shp", package="terra")
v <- vect(f)

str(pack(v))
Formal class 'PackedSpatVector' [package "terra"] with 5 slots
  ..@ type       : chr "polygons"
  ..@ crs        : chr "GEOGCRS[\"WGS 84\",\n    DATUM[\"World Geodetic System 1984\",\n        ELLIPSOID[\"WGS 84\",6378137,298.257223"| __truncated__
  ..@ coordinates: num [1:3995, 1:2] 6.03 6.03 6.04 6.04 6.04 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : NULL
  .. .. ..$ : chr [1:2] "x" "y"
  ..@ index      : num [1:12, 1:4] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : NULL
  .. .. ..$ : chr [1:4] "id" "part" "hole" "start"
  ..@ attributes :'data.frame': 12 obs. of  5 variables:
  .. ..$ ID_1  : num [1:12] 1 1 1 1 1 2 2 2 3 3 ...
  .. ..$ NAME_1: chr [1:12] "Diekirch" "Diekirch" "Diekirch" "Diekirch" ...
  .. ..$ ID_2  : num [1:12] 1 2 3 4 5 6 7 12 8 9 ...
  .. ..$ NAME_2: chr [1:12] "Clervaux" "Diekirch" "Redange" "Vianden" ...
  .. ..$ AREA  : num [1:12] 312 218 259 76 263 188 129 210 185 251 ...

I find it a terrible shame that there's no community-consistency around use of this kind of representation, but obviously several folks have settled on something that could be pretty standard so there's enough to tell a story here, I'll try to do that.

paleolimbot commented 4 years ago

To support some stuff in S2 I needed a coordinates -> WK converter, and it turns out it's really fast! I haven't looked at the source code of sfheaders or geometries closely enough to know what the differences are (or maybe my benchmark is measuring something different). This also doesn't do multi* geometries yet.

library(wk) # remotes::install_github("paleolimbot/wk")
states_df <- ggplot2::map_data("state")
states_df$region <- factor(states_df$region)

bench::mark(
  wk = coords_polygon_translate_wkb(
    states_df$long,
    states_df$lat,
    feature_id = states_df$region, 
    ring_id = states_df$group
  ),
  wk_sxp = coords_polygon_translate_wksxp(
    states_df$long,
    states_df$lat,
    feature_id = states_df$region, 
    ring_id = states_df$group
  ),
  sfheaders = sfheaders::sfc_polygon(
    states_df,
    x = "long",
    y = "lat", 
    polygon_id = "region",
    linestring_id = "group"
  ),
  check = FALSE
)[1:5]
#> # A tibble: 3 x 5
#>   expression      min   median `itr/sec` mem_alloc
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 wk            970µs   1.25ms     800.     1.26MB
#> 2 wk_sxp      749.1µs   1.01ms     987.     1.15MB
#> 3 sfheaders    12.3ms  12.78ms      77.2 1018.67KB

^{Created on 2020-06-21 by the reprex package (v0.3.0)}

dcooley commented 4 years ago

yeah the polygon performance is a known limitation, and I've also been working on improving it, whilst also keeping the function generic so it will work on any geometry / level of nesting in the result.

The splitting of the data is the easy bit I think, and I've added it as a comparison to your benchmark

microbenchmark::microbenchmark(
  wk = { coords_polygon_translate_wkb(
      states_df$long,
      states_df$lat,
      feature_id = states_df$region, 
      ring_id = states_df$group
    )
  },

  wk_sxp = { coords_polygon_translate_wksxp(
      states_df$long,
      states_df$lat,
      feature_id = states_df$region, 
      ring_id = states_df$group
    )
  }, 

  geom = {
    geometries:::rcpp_nested_rleid(
      states_df
      , c( 4, 5 )
    )
  }
)

# Unit: microseconds
#   expr     min       lq      mean   median       uq       max neval
#     wk 645.516 695.6870  968.0927 712.8785 1011.287  8657.403   100
# wk_sxp 437.039 479.7635  736.2450 496.5695  569.168 12404.027   100
#   geom 714.097 748.2310 1047.1301 772.6895  920.459  9118.749   100

It still needs a bit of work to package the output, but hopefully I'll end up with one function which can create any geometry shape.

paleolimbot commented 4 years ago

Cool! Creating any geometry shape sounds really cool. When you've got your geometry format down I'd love to write wk readers and writers to support them!

dcooley commented 4 years ago

Another update - I've got a general "make geometry" function, which will put matrices inside lists based on the number of id columns. So there's no need to switch on geometry type (POINT, LINE, POLYGON). And it will work on any number of geometry columns, so no need to specify x, y, z, m, a, b, ...

I'm going to add in a nest argument so you can also control how many levels of nesting there are

Updated benchmark

library(wk)

states_df <- ggplot2::map_data("state")
states_df$region <- factor(states_df$region)

microbenchmark::microbenchmark(
  wk = {
    wk <- coords_polygon_translate_wkb(
      states_df$long,
      states_df$lat,
      feature_id = states_df$region,
      ring_id = states_df$group
    )
  },

  wk_sxp = {
    wk_sxp <- coords_polygon_translate_wksxp(
      states_df$long,
      states_df$lat,
      feature_id = states_df$region,
      ring_id = states_df$group
    )
  },

  geom = {
    g <- geometries:::rcpp_nested_rleid(
      states_df
      , c( 4, 5 )
      , c( 0, 1 )
    )
  }
)

# Unit: microseconds
#   expr     min      lq     mean   median       uq       max neval
#     wk 682.953 702.714 915.5185 714.7985 761.6645  8644.427   100
# wk_sxp 463.638 487.296 627.3311 496.5690 520.1725 10669.165   100
#   geom 527.690 584.324 915.1112 593.5455 623.5320 11102.187   100

> g[[8]]
[[1]]
[,1]     [,2]
[1,] -77.13731 38.94394
[2,] -77.06283 38.99551
[3,] -77.01699 38.97259
[4,] -76.93105 38.89238
[5,] -77.05136 38.80643
[6,] -77.05136 38.82935
[7,] -77.06283 38.86373
[8,] -77.07428 38.88664
[9,] -77.09720 38.90956
[10,] -77.13731 38.94394

> wk_sxp[[8]]
[[1]]
[,1]     [,2]
[1,] -77.13731 38.94394
[2,] -77.06283 38.99551
[3,] -77.01699 38.97259
[4,] -76.93105 38.89238
[5,] -77.05136 38.80643
[6,] -77.05136 38.82935
[7,] -77.06283 38.86373
[8,] -77.07428 38.88664
[9,] -77.09720 38.90956
[10,] -77.13731 38.94394

attr(,"class")
[1] "wk_polygon"

paleolimbot commented 4 years ago

That's genius! Absolutely the right way to parameterize this on the C++ end. Switching on geometry type is a pain, which is why there are no multi* parsers in wk.

dcooley / geometries

hypertidy support #4