geoarrow / geoarrow-python

Python implementation of the GeoArrow specification
http://geoarrow.org/geoarrow-python/
Apache License 2.0
59 stars 3 forks source link

Wire up interface for to/from geopandas #7

Closed paleolimbot closed 11 months ago

paleolimbot commented 11 months ago

Wires up an interface (and leaves future PRs to handle integration with to/from ragged array).

For pandas integration:

import geoarrow.pandas as _
import pandas as pd

df = pd.read_parquet("https://github.com/geoarrow/geoarrow-data/releases/download/latest-dev/ns-water-basin_point.parquet")
df.geometry.geoarrow.to_geopandas()
#> 0     POINT (277022.694 4820886.610)
#> 1     POINT (315701.255 4855051.379)
#> 2     POINT (255728.660 4851022.108)
#> 3     POINT (245206.784 4895609.410)

df.geometry[0].to_shapely()
#> POINT (277022.6936181751 4820886.609673489)

For pyarrow integration:

import geoarrow.pyarrow as ga

array = ga.array(["POINT (30 10)"])
ga.to_geopandas(array)
#> 0    POINT (30.00000 10.00000)
#> dtype: geometry
array[0].to_shapely()
#> <POINT (30 10)>
codecov[bot] commented 11 months ago

Codecov Report

Merging #7 (7f75f9f) into main (acdb60c) will increase coverage by 94.37%. The diff coverage is 95.45%.

@@            Coverage Diff            @@
##           main       #7       +/-   ##
=========================================
+ Coverage      0   94.37%   +94.37%     
=========================================
  Files         0        9        +9     
  Lines         0     1262     +1262     
=========================================
+ Hits          0     1191     +1191     
- Misses        0       71       +71     
Files Changed Coverage Δ
geoarrow-pyarrow/src/geoarrow/pyarrow/__init__.py 85.00% <ø> (ø)
geoarrow-pyarrow/src/geoarrow/pyarrow/_scalar.py 68.88% <91.66%> (ø)
geoarrow-pandas/src/geoarrow/pandas/lib.py 93.50% <100.00%> (ø)
geoarrow-pyarrow/src/geoarrow/pyarrow/_compute.py 98.00% <100.00%> (ø)

... and 5 files with indirect coverage changes

kylebarron commented 11 months ago

It would be interesting to benchmark the conversion via WKB and via to/from_ragged_array. My impression is that the current shapely implementation for to/from ragged array is not the highest performance.

paleolimbot commented 11 months ago

For benchmarking, I get:

import numpy as np
import geoarrow.pyarrow as ga

coords = np.random.random(int(2e6)).reshape((int(1e6), 2))

array = ga.point().with_coord_type(ga.CoordType.INTERLEAVED).from_geobuffers(
    None, 
    coords.reshape((int(2e6)))
)

%timeit ga.to_geopandas(array)
#> 180 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

from shapely import from_ragged_array, GeometryType

%timeit from_ragged_array(GeometryType.POINT, coords)
#> 161 ms ± 561 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
kylebarron commented 11 months ago

So from_ragged_array is ~10% faster than going through WKB?

paleolimbot commented 11 months ago

I think so!

jorisvandenbossche commented 11 months ago

It would be interesting to benchmark the conversion via WKB and via to/from_ragged_array. My impression is that the current shapely implementation for to/from ragged array is not the highest performance.

Yes, that's known, see the bullet point "performance" at https://github.com/shapely/shapely/pull/1559 (where at the time I also mentioned it has somewhat the same performance as WKB conversion).

That's just because the initial implementation was the "easy" way, reusing existing functions in some python code (and thus looping over the objects several times). I do have a draft that pushes this into cython that does the actual conversion in a single loop (or with a second loop to count coordinates in case of to_ragged_array).

kylebarron commented 11 months ago

I do have a draft that pushes this into cython that does the actual conversion in a single loop (or with a second loop to count coordinates in case of to_ragged_array).

I'd be interested in optimizing this! Should I take https://github.com/pygeos/pygeos/pull/93 as inspiration?

paleolimbot commented 11 months ago

That's just because the initial implementation was the "easy" way, reusing existing functions in some python code (and thus looping over the objects several times). I do have a draft that pushes this into cython that does the actual conversion in a single loop (or with a second loop to count coordinates in case of to_ragged_array).

This is going to come up in R, too. You all don't need to tackle it, but there's no reason there can't be a common C wrapper around it all 🙂

paleolimbot commented 11 months ago

I also know that you both know this, but before anybody spends a lot of time optimizing geopandas--geoarrow conversion, it is worth mentioning that the number one optimization anybody can do today is to avoid shapely entirely (i.e., via pyarrow's parquet.read_table() or pyogrio.raw.read_table()).

kylebarron commented 11 months ago

but there's no reason there can't be a common C wrapper around it all

yeah... I'd probably be able to update the cython parts of shapely, but I definitely couldn't write a C library to do this, and don't understand how the integration parts with GEOS would work.

optimizing geopandas--geoarrow conversion

yeah.. but in the short term, people have their data in GeoPandas and they're using it every day, so optimizing GeoPandas -> GeoArrow will make all this kind of work (through GeoArrow) easier to use

jorisvandenbossche commented 11 months ago

On the shapely side, we need to work with shapely.Geometry objects, and so that needs to be written in shapely itself (there is a C API, but that's not really / officially public at the moment, although you can use it to test things outside of shapely). It's true that the core algo (from buffers to GEOSGeometry objects or the other way around) could be written in C and shared, though (but then rather something to vendor instead of dependency?).

I'd be interested in optimizing this! Should I take pygeos/pygeos#93 as inspiration?

It's only on my longer term to do list, so feel free to take a look at it! (and I can promise swift reviews if you would do a PR ;)) That PR is indeed the inspiration, will check tomorrow if I have something more complete locally (that PR is only for MultiPolygons and only geos->geoarrow direction)

it is worth mentioning that the number one optimization anybody can do today is to avoid shapely entirely (i.e., via pyarrow's parquet.read_table() or pyogrio.raw.read_table()).

We should also enable this in geopandas itself at some point by allowing to store "raw" geoarrow data with delayed conversion to GEOS. But anyway, whenever you do some actual geospatial operation that needs GEOS, you need this conversion, so it's still very much worth optimizing IMO

kylebarron commented 11 months ago

That PR is indeed the inspiration, will check tomorrow if I have something more complete locally (that PR is only for MultiPolygons and only geos->geoarrow direction)

Probably warrants a bit more discussion of the right approach before trying to write a PR. It sounds pretty similar to what I do here, where my first pass computes the number of coords, rings, and geometries in an array of polygons, and the second pass allocates the arrays once and fills them. But crucially this assumes you already know the geometries are all polygons.

So in shapely, would the best approach likely be to first iterate and infer the output array type, and then have two more passes to count coords and fill array values? It seems likely possible to consolidate the first two passes into a single pass, but it would probably be more difficult code.

paleolimbot commented 11 months ago

Maybe something like:

class GEOSToGeoArrowBuilder:

    def __init__(self, output_type):
        pass

    def reserve_coords(self, n):
         pass

    def reserve_offsets(self, n, level):
        pass

    def append(self, array_of_shapely):
        pass

    def append_unsafe(self, array_of_shapely):
        # Probably faster if you know that you have reserved properly
        pass

Calculating the output type is probably best done in advance (the caller may already know). geoarrow.pyarrow.infer_type_common() can do this for WKB or WKT (which is the current route by which geoarrow.pyarrow.as_geoarrow() should correctly guess a type (or fall back on WKB if there is no common type).