geoarrow / geoarrow-r

Extension types for geospatial data for use with 'Arrow'
http://geoarrow.org/geoarrow-r/
Apache License 2.0
155 stars 6 forks source link

Handling of geoparquet when not loading `geoarrow` #28

Open kylebutts opened 11 months ago

kylebutts commented 11 months ago

First of all, thanks for this awesome work. It's been great to see the progress on all this :-)

In the example on the readme, you load a .parquet file that contains a geometry example. Since there is not a separate naming format/convention (e.g. .geo.parquet or .geoparquet), I might not know that there is a geometry in there, so I just load arrow and open the dataset as normal. Looking at the geometry column would be confusing to me. This behavior differs whether I have the geoarrow package loaded or not.

library(tidyverse)
library(arrow)

open_dataset("~/Desktop/nc.parquet") |>
  head(n = 1) |>
  pull(geometry, as_vector = TRUE)
#> <arrow_binary[1]>
#> [1] 01, 06, 00, 00, 00, 01, 00, 00, 00, 01, 03, 00, 00, 00, 01, 00, 00, 00, 1b, 00, 00, 00, 00, 00, 00, a0, 41, 5e, 54, c0, 00, 00, ...

library(geoarrow)
open_dataset("~/Desktop/nc.parquet") |>
  head(n = 1) |>
  pull(geometry, as_vector = TRUE)
#> <geoarrow_wkb[1]>
#> [1] MULTIPOLYGON (((-81.47276 36.23436, -81.54084 36.27251, -81.56198 36.27359, -81.63306 36.34069, -81.74107 36.39178, -81.69828 36.47178...

This issue might should be in the R arrow package, but I'm wondering if arrow should detect when there is a geometry column present and adjust behavior (the metadata is in there, so this information is known). For example, when calling collect(), should there be a warning that a geometry column is being collected and that geoarrow::st_collect() might be the better option (as in https://github.com/paleolimbot/geoarrow/issues/21)? Or a warning when opening a geoparquet without geoarrow loaded?

library(tidyverse)
library(arrow)

nc = open_dataset("~/Desktop/nc.parquet") 
# We know there is a geometry from the metadata
nc$metadata[[1]]
#> [1] "{\"version\":\"0.3.0\",\"primary_column\":\"geometry\",\"columns\":{\"geometry\":{\"encoding\":\"WKB\",\"crs\":\"GEOGCS[\\\"NAD27\\\",DATUM[\\\"North_American_Datum_1927\\\",SPHEROID[\\\"Clarke 1866\\\",6378206.4,294.978698213898]],PRIMEM[\\\"Greenwich\\\",0],UNIT[\\\"degree\\\",0.0174532925199433,AUTHORITY[\\\"EPSG\\\",\\\"9122\\\"]],AXIS[\\\"Latitude\\\",NORTH],AXIS[\\\"Longitude\\\",EAST],AUTHORITY[\\\"EPSG\\\",\\\"4267\\\"]]\",\"bbox\":[-84.3239,33.882,-75.457,36.5896],\"geometry_type\":\"MultiPolygon\"}}}"
paleolimbot commented 11 months ago

First, just a note that a rewrite is in progress and should be available in January! The new package currently lives here: https://github.com/geoarrow/geoarrow-c/tree/main/r/geoarrow but may get moved to a less confusing location (like geoarrow/geoarrow-r). I'm just in the process of working with the extension type registration ( https://github.com/geoarrow/geoarrow-c/pull/85 ) so this is well-timed!

The issue of automatic loading is a tricky one...the arrow package maybe shouldn't load arbitrary packages (as in, if we somehow encoded "r_pkgs" in the metadata or something), and while it could special-case the geoarrow package when this it is on CRAN, special-casing things can become unwieldy.

In any case, the first step is geoarrow on CRAN 🙂 ...I'm working on it!

mrworthington commented 10 months ago

The new package currently lives here: https://github.com/geoarrow/geoarrow-c/tree/main/r/geoarrow but may get moved to a less confusing location

Hi Dewey + Team! Trying to play with geoarrow for a project, but am not finding the new package you referenced. I clicked on the link above, but it just shows a "404 page not found" error. Hoping to use it in combination with open_dataset() on a shiny app I'm spinning up! For context, I've installed this current version of {geoarrow-r}, but am assuming this is not the one that you want people to be using.

kylebutts commented 10 months ago

@mrworthington This pull request suggests they were moved back to this repo 2 weeks ago: https://github.com/geoarrow/geoarrow-c/pull/89

paleolimbot commented 10 months ago

This is indeed the version that I'd like people to be using; however, it is missing the read_geoparquet_sf() function ( https://github.com/geoarrow/geoarrow-r/pull/30 ). If you need the previous version, I tagged it as 0.1.0.

Development did start out in geoarrow-c, but ultimately I found that it made more sense to keep it on its own (hence, geoarrow-r!).

jaredlander commented 9 months ago

This is indeed the version that I'd like people to be using; however, it is missing the read_geoparquet_sf() function ( #30 ). If you need the previous version, I tagged it as 0.1.0.

Going forward, am I correct that we won't need to read_geoparquet_sf() but rather just use read_parquet()? And if so, will it automatically become an sf object? Currently with version 0.1.0.900, I have to run read_parquet('file.parquet') |> geoarrow:::st_as_sf.Dataset() because if I don't use geoarrow:::st_as_sf.Dataset() I get the following error:

Error in st_geometry.sf(x) : 
  attr(obj, "sf_column") does not point to a geometry column.
Did you rename it, without setting st_geometry(obj) <- "newname"?
paleolimbot commented 9 months ago

If you are only reading/writing Parquet files in R (with geoarrow loaded) and/or Python (after import geoarrow.pyarrow), you can just use write_parquet() and read_parquest(). This is not GeoParquet...it's just regular Parquet with extension types. This means that something like GDAL won't be able to understand it (yet) and uploading it to a cloud data warehouse won't work. The upside of not using GeoParquet is that more arrow tools work out-of-the-box (e.g., multi-file datasets via write_dataset()/open_dataset() in R or Python).

If you need to read with GDAL or some other tool, I would recommend using geoarrow::read_geoparquet_sf() (or geoarrow::read_geoparquet()) and geoarrow::write_geoparquet() going forward; however, I still have to finish the implementation (#30).

if I don't use geoarrow:::st_as_sf.Dataset() I get the following error:

I think you might want read_parquet(f, as_data_frame = FALSE) + st_as_sf(). I think the problem is that sf doesn't know that a lazy geoarrow column is "geometry". Eventually it probably will but the details of that are complicated and for now you'll have to help it.

jaredlander commented 9 months ago

Thanks for the info!

If you need to read with GDAL or some other tool, I would recommend using geoarrow::read_geoparquet_sf() (or geoarrow::read_geoparquet()) and geoarrow::write_geoparquet() going forward; however, I still have to finish the implementation (https://github.com/geoarrow/geoarrow-r/pull/30).

So it sounds like geoarrow::write_geoparquet() and friends are coming back? So I can install with renv::install('geoarrow/geoarrow-r@v0.1.0') which gets me 0.1.0 instead of renv::install('geoarrow/geoarrow-r') which gets me 0.1.0.9000?

I'm using this to write parquet files to map with geoarrow/deck.gl layers (as opposed to pmtiles). The README says

Pass -lco GEOMETRY_ENCODING=GEOARROW when converting to Arrow or Parquet files in order to store geometries in a GeoArrow-native geometry column.

Likewise, this post says

Notice the GEOMETRY_ENCODING=GEOARROW? This file isn't quite valid GeoParquet, at least as of version 1.0, because it stores geometries in the efficient Arrow-native encoding instead of as WKB geometries.

This is needed for now because parquet-wasm doesn't have a way to parse WKB geometries into Arrow-native geometries. (A @geoarrow/geoparquet-wasm library is likely to be published by the end of 2023 that will parse any GeoParquet file and load it to GeoArrow.)

So I'm guessing I need to use geoarrow::write_geoparquet()? Which I get using 0.1.0, correct?

I think you might want read_parquet(f, as_data_frame = FALSE) + st_as_sf(). I think the problem is that sf doesn't know that a lazy geoarrow column is "geometry". Eventually it probably will but the details of that are complicated and for now you'll have to help it.

Yep, that fixed it!

paleolimbot commented 9 months ago

So it sounds like geoarrow::write_geoparquet() and friends are coming back?

Yes! With proper conformance to the 1.0.0 spec. The 1.0.0 spec doesn't include GeoArrow as an encoding option - it's WKB only - and there's some debate over whether it should be there in the first place.

So I'm guessing I need to use geoarrow::write_geoparquet()? Which I get using 0.1.0, correct?

I actually have no idea. I think maybe write_parquet() will work, but you might have to explicitly tell it to use interleaved coordinates. Off the top of my head I forget exactly how to do that but I'll try to circle back with an example.