geoarrow / geoarrow-r

Extension types for geospatial data for use with 'Arrow'
http://geoarrow.org/geoarrow-r/
Apache License 2.0
155 stars 6 forks source link

cannot write geoarrow ipc file with interleaved coordinates #52

Closed tim-salabim closed 1 month ago

tim-salabim commented 2 months ago

Is it possible to write geoarrow data with interleaved coordinates? The following reprex fails for me (whereas it works with separate coordinates)

library(geoarrow)
library(nanoarrow)
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE

n = 100

dat = data.frame(
  id = 1:n
  , lon = rnorm(n, 19, 3)
  , lat = rnorm(n, 50, 3)
)

pts = st_as_sf(dat, coords = c("lon", "lat"), crs = 4326)

schema = na_extension_geoarrow(
  "POINT"
  , crs = geoarrow_schema_parse(infer_geoarrow_schema(pts))$crs
  # , coord_type = "SEPARATE"
  , coord_type = "INTERLEAVED"
)

schema
#> <nanoarrow_schema geoarrow.point{fixed_size_list(2)}>
#>  $ format    : chr "+w:2"
#>  $ name      : NULL
#>  $ metadata  :List of 2
#>   ..$ ARROW:extension:name    : chr "geoarrow.point"
#>   ..$ ARROW:extension:metadata: chr "{\"crs\":{\n  \"$schema\": \"https://proj.org/schemas/v0.7/projjson.schema.json\",\n  \"type\": \"GeographicCRS"| __truncated__
#>  $ flags     : int 2
#>  $ children  :List of 1
#>   ..$ xy:<nanoarrow_schema double>
#>   .. ..$ format    : chr "g"
#>   .. ..$ name      : chr "xy"
#>   .. ..$ metadata  : list()
#>   .. ..$ flags     : int 0
#>   .. ..$ children  : list()
#>   .. ..$ dictionary: NULL
#>  $ dictionary: NULL

as_geoarrow_array(pts, schema = schema)
#> <nanoarrow_array geoarrow.point{fixed_size_list(2)}[100]>
#>  $ length    : int 100
#>  $ null_count: int 0
#>  $ offset    : int 0
#>  $ buffers   :List of 1
#>   ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>  $ children  :List of 1
#>   ..$ xy:<nanoarrow_array double[200]>
#>   .. ..$ length    : int 200
#>   .. ..$ null_count: int 0
#>   .. ..$ offset    : int 0
#>   .. ..$ buffers   :List of 2
#>   .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>   .. .. ..$ :<nanoarrow_buffer data<double>[200][1600 b]> `19.1 46.6 17.1 51...`
#>   .. ..$ dictionary: NULL
#>   .. ..$ children  : list()
#>  $ dictionary: NULL

write_nanoarrow(
  as_geoarrow_array(pts, schema = schema)
  , "/home/tim/Downloads/test.arrow"
)
#> Error in write_nanoarrow.connection(data, con): ArrowIpcWriterWriteArrayStream() failed: Cannot encode schema with format '+w:2'; top level schema must have struct type

arrow::write_ipc_stream(
  as_geoarrow_array(pts, schema = schema)
  , "/home/tim/Downloads/test.arrow"
)
#> Error: Invalid: Cannot import schema: ArrowSchema describes non-struct type geoarrow.point <CRS: {
#>   "$schema": "https://pro...

arrow::write_parquet(
  as_geoarrow_array(pts, schema = schema)
  , "/home/tim/Downloads/test.parquet"
)
#> Error: Invalid: Cannot import schema: ArrowSchema describes non-struct type geoarrow.point <CRS: {
#>   "$schema": "https://pro...

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Europe/Berlin
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] sf_1.0-17            nanoarrow_0.5.0.9000 geoarrow_0.2.1.9000 
#> 
#> loaded via a namespace (and not attached):
#>  [1] bit_4.0.5          compiler_4.4.1     tidyselect_1.2.1   reprex_2.1.0      
#>  [5] Rcpp_1.0.13        assertthat_0.2.1   arrow_15.0.1       yaml_2.3.10       
#>  [9] fastmap_1.2.0      R6_2.5.1           classInt_0.4-10    knitr_1.48        
#> [13] units_0.8-5        R.cache_0.16.0     DBI_1.2.3          R.utils_2.12.3    
#> [17] rlang_1.1.4        xfun_0.47          fs_1.6.4           bit64_4.0.5       
#> [21] cli_3.6.3          withr_3.0.0        magrittr_2.0.3     class_7.3-22      
#> [25] wk_0.9.3           digest_0.6.37      grid_4.4.1         rstudioapi_0.16.0 
#> [29] lifecycle_1.0.4    R.methodsS3_1.8.2  R.oo_1.26.0        vctrs_0.6.5       
#> [33] KernSmooth_2.23-22 proxy_0.4-27       evaluate_0.24.0    glue_1.7.0        
#> [37] styler_1.10.3      e1071_1.7-14       rmarkdown_2.28     purrr_1.0.2       
#> [41] tools_4.4.1        htmltools_0.5.8.1

Created on 2024-09-23 with reprex v2.1.0

Note, this is triggered by https://github.com/geoarrow/deck.gl-layers/issues/126 as deck.gl-layers currently only accepts interleaved coordinates.

paleolimbot commented 2 months ago

IPC can only handle tables, so this will fail for line strings too! I think you will have to wrap with a one column struct/data.frame.

tim-salabim commented 2 months ago

I don't think I understand, sorry. Can you provide an example of how to write a sf object to arrow with interleaved coordinates?

paleolimbot commented 2 months ago

Sorry I lost track of this! There are some rough edges on geoarrow's part here, but what I was getting at is that write_nanoarrow() (or Arrow's write_parquet()) only handle data frame-like things. In other words, it's not the structness of the coordinates that is the problem here, it's that geometry needs to be its own column in a struct (like an sf object).

library(geoarrow)
library(nanoarrow)
library(sf)
#> Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE

n = 100

dat = data.frame(
  id = 1:n
  , lon = rnorm(n, 19, 3)
  , lat = rnorm(n, 50, 3)
)

pts = st_as_sf(dat, coords = c("lon", "lat"), crs = 4326)

# The ability to pass through geoarrow_coord_type = ... through
# infer_nanoarrow_schema.sfc/sf and/or as_nanoarrow_array_stream()
# would eliminate the need for this part
geom_col_name <- attr(pts, "sf_column")
geom_type <- infer_geoarrow_schema(pts, coord_type = "INTERLEAVED")
pts_schema <- infer_nanoarrow_schema(pts)
pts_schema$children[[geom_col_name]] <- geom_type

pts_schema
#> <nanoarrow_schema struct>
#>  $ format    : chr "+s"
#>  $ name      : chr ""
#>  $ metadata  : list()
#>  $ flags     : int 0
#>  $ children  :List of 2
#>   ..$ id      :<nanoarrow_schema int32>
#>   .. ..$ format    : chr "i"
#>   .. ..$ name      : chr "id"
#>   .. ..$ metadata  : list()
#>   .. ..$ flags     : int 2
#>   .. ..$ children  : list()
#>   .. ..$ dictionary: NULL
#>   ..$ geometry:<nanoarrow_schema geoarrow.point{fixed_size_list(2)}>
#>   .. ..$ format    : chr "+w:2"
#>   .. ..$ name      : chr "geometry"
#>   .. ..$ metadata  :List of 2
#>   .. .. ..$ ARROW:extension:name    : chr "geoarrow.point"
#>   .. .. ..$ ARROW:extension:metadata: chr "{\"crs\":{\n  \"$schema\": \"https://proj.org/schemas/v0.4/projjson.schema.json\",\n  \"type\": \"GeographicCRS"| __truncated__
#>   .. ..$ flags     : int 2
#>   .. ..$ children  :List of 1
#>   .. .. ..$ xy:<nanoarrow_schema double>
#>   .. .. .. ..$ format    : chr "g"
#>   .. .. .. ..$ name      : chr "xy"
#>   .. .. .. ..$ metadata  : list()
#>   .. .. .. ..$ flags     : int 0
#>   .. .. .. ..$ children  : list()
#>   .. .. .. ..$ dictionary: NULL
#>   .. ..$ dictionary: NULL
#>  $ dictionary: NULL

out <- tempfile()
pts |>
  as_nanoarrow_array_stream(schema = pts_schema) |>
  write_nanoarrow(out)

# Implementing st_as_sf.nanoarrow_array_stream in geoarrow would make this
# much simpler!
tbl <- read_nanoarrow(out) |>
  tibble::as_tibble()

tbl[[geom_col_name]] <- st_as_sfc(tbl[[geom_col_name]])
sf::st_as_sf(tbl)
#> Simple feature collection with 100 features and 1 field
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 12.02193 ymin: 43.40038 xmax: 24.99115 ymax: 55.39772
#> Geodetic CRS:  WGS 84
#> # A tibble: 100 × 2
#>       id            geometry
#>    <int>         <POINT [°]>
#>  1     1 (23.18507 49.99223)
#>  2     2 (16.00192 45.08137)
#>  3     3  (20.83954 52.3167)
#>  4     4 (17.30396 46.12193)
#>  5     5 (14.45836 53.15084)
#>  6     6 (20.09797 50.69008)
#>  7     7 (15.51338 51.11113)
#>  8     8 (17.65451 46.20986)
#>  9     9 (20.65463 51.50423)
#> 10    10 (17.04951 51.80477)
#> # ℹ 90 more rows

I am not sure I ever completed the thought in wk, but the chunking logic by coordinate might be useful to you if you're streaming large things:

library(sf)
#> Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE

n = 100

dat = data.frame(
  id = 1:n
  , lon = rnorm(n, 19, 3)
  , lat = rnorm(n, 50, 3)
)

pts = st_as_sf(dat, coords = c("lon", "lat"), crs = 4326)

chunker <- wk::wk_chunk_strategy_coordinates(chunk_size = 15)
chunker(list(pts), nrow(pts))
#>   from  to
#> 1    1  15
#> 2   16  30
#> 3   31  45
#> 4   46  60
#> 5   61  75
#> 6   76  90
#> 7   91 100

Created on 2024-09-25 with reprex v2.1.1

tim-salabim commented 1 month ago

Thanks @paleolimbot ! The workaround - which I can live with - works like a charm! We can now render massive amounts of data in a few seconds! I'll post an example on mastodon once I have rounded all the edges. This is awesome!