geoarrow / geoarrow-r

Extension types for geospatial data for use with 'Arrow'
http://geoarrow.org/geoarrow-r/
Apache License 2.0
148 stars 6 forks source link

`st_as_sf()` only works after `open_dataset()` but not after `read_parquet()` #51

Open PaulC91 opened 1 month ago

PaulC91 commented 1 month ago

Hey Dewey,

Thanks so much for your work on this, really excited to start using it!

I noticed that reading and converting a parquet file written with geoarrow to an sf object only works if you open the file with open_dataset() first, but not if you read it with read_parquet() (reprex below).

Maybe this is by design? But I thought I'd flag in case it's not!

Thanks, Paul

library(geoarrow)
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
#> The repository you retrieved Arrow from did not include all of Arrow's features.
#> You can install a fully-featured version by running:
#> `install.packages('arrow', repos = 'https://apache.r-universe.dev')`.
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.9.0, PROJ 9.4.0; sf_use_s2() is TRUE

nc <- read_sf(system.file("gpkg/nc.gpkg", package = "sf"))
tf <- tempfile(fileext = ".parquet")

nc |> 
  tibble::as_tibble() |> 
  write_parquet(tf)

# this works
open_dataset(tf) |> st_as_sf()
#> Simple feature collection with 100 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS:  NAD27
#> First 10 features:
#>     AREA PERIMETER CNTY_ CNTY_ID        NAME  FIPS FIPSNO CRESS_ID BIR74 SID74
#> 1  0.114     1.442  1825    1825        Ashe 37009  37009        5  1091     1
#> 2  0.061     1.231  1827    1827   Alleghany 37005  37005        3   487     0
#> 3  0.143     1.630  1828    1828       Surry 37171  37171       86  3188     5
#> 4  0.070     2.968  1831    1831   Currituck 37053  37053       27   508     1
#> 5  0.153     2.206  1832    1832 Northampton 37131  37131       66  1421     9
#> 6  0.097     1.670  1833    1833    Hertford 37091  37091       46  1452     7
#> 7  0.062     1.547  1834    1834      Camden 37029  37029       15   286     0
#> 8  0.091     1.284  1835    1835       Gates 37073  37073       37   420     0
#> 9  0.118     1.421  1836    1836      Warren 37185  37185       93   968     4
#> 10 0.124     1.428  1837    1837      Stokes 37169  37169       85  1612     1
#>    NWBIR74 BIR79 SID79 NWBIR79                           geom
#> 1       10  1364     0      19 MULTIPOLYGON (((-81.47276 3...
#> 2       10   542     3      12 MULTIPOLYGON (((-81.23989 3...
#> 3      208  3616     6     260 MULTIPOLYGON (((-80.45634 3...
#> 4      123   830     2     145 MULTIPOLYGON (((-76.00897 3...
#> 5     1066  1606     3    1197 MULTIPOLYGON (((-77.21767 3...
#> 6      954  1838     5    1237 MULTIPOLYGON (((-76.74506 3...
#> 7      115   350     2     139 MULTIPOLYGON (((-76.00897 3...
#> 8      254   594     2     371 MULTIPOLYGON (((-76.56251 3...
#> 9      748  1190     2     844 MULTIPOLYGON (((-78.30876 3...
#> 10     160  2038     5     176 MULTIPOLYGON (((-80.02567 3...

# this doesn't
read_parquet(tf) |> st_as_sf()
#> Error in st_sf(x, ..., agr = agr, sf_column_name = sf_column_name): no simple features geometry column present

# but it does if you convert the geom col first
read_parquet(tf) |> dplyr::mutate(geom = st_as_sfc(geom)) |> st_as_sf()
#> Simple feature collection with 100 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS:  NAD27
#> # A tibble: 100 × 15
#>     AREA PERIMETER CNTY_ CNTY_ID NAME  FIPS  FIPSNO CRESS_ID BIR74 SID74 NWBIR74
#>    <dbl>     <dbl> <dbl>   <dbl> <chr> <chr>  <dbl>    <int> <dbl> <dbl>   <dbl>
#>  1 0.114      1.44  1825    1825 Ashe  37009  37009        5  1091     1      10
#>  2 0.061      1.23  1827    1827 Alle… 37005  37005        3   487     0      10
#>  3 0.143      1.63  1828    1828 Surry 37171  37171       86  3188     5     208
#>  4 0.07       2.97  1831    1831 Curr… 37053  37053       27   508     1     123
#>  5 0.153      2.21  1832    1832 Nort… 37131  37131       66  1421     9    1066
#>  6 0.097      1.67  1833    1833 Hert… 37091  37091       46  1452     7     954
#>  7 0.062      1.55  1834    1834 Camd… 37029  37029       15   286     0     115
#>  8 0.091      1.28  1835    1835 Gates 37073  37073       37   420     0     254
#>  9 0.118      1.42  1836    1836 Warr… 37185  37185       93   968     4     748
#> 10 0.124      1.43  1837    1837 Stok… 37169  37169       85  1612     1     160
#> # ℹ 90 more rows
#> # ℹ 4 more variables: BIR79 <dbl>, SID79 <dbl>, NWBIR79 <dbl>,
#> #   geom <MULTIPOLYGON [°]>

Created on 2024-07-15 with reprex v2.0.2

paleolimbot commented 1 month ago

Thanks for opening!

This is not by design, per se, but I think it happens because st_as_sf() will only recognize columns that inherit from sfc. In order for this to work, it would have to apply some heuristics to recognize other geometry-like columns within an existing data.frame. For things that haven't yet been converted to a data.frame, we own the st_as_sf implementation, which is why the conversion works 🙂 .

I should probably open an issue in sf for this!

-dewey

PaulC91 commented 1 month ago

Makes sense! Thanks for clarifying 👍