geoarrow / geoarrow-python

Python implementation of the GeoArrow specification
http://geoarrow.org/geoarrow-python/
Apache License 2.0
59 stars 3 forks source link

Pandas integration does not symmetrically store and load with feather format #44

Open EternalDeiwos opened 6 months ago

EternalDeiwos commented 6 months ago

I am playing around with the geoarrow.pandas integration and found something odd; if I load a data frame containing a geometry column it will successfully load and display the geometry correctly but I am unable able to do anything with it. Anything I try (e.g. df.geometry.geoarrow.*) produces the following error:

TypeError: Can't create geoarrow.array from Arrow array of type None

I created the file like this:

import geoarrow.pyarrow as ga
import geoarrow.pandas as _
import pandas as pd
import numpy as np

points = np.random.rand((1 << 20, 2))

df = pd.DataFrame({
    "geometry": ga.point().from_geobuffers(
        None,
        points[:, 0],
        points[:, 1]
    )
})

df.to_feather('points.feather')

and I load the file like this

import geoarrow.pyarrow as ga
import geoarrow.pandas as _
import pandas as pd

df = pd.read_feather("points.feather")

# Example operations that produce the above error
df.astype({ 'geometry': 'geoarrow.wkt' })
x, y = df.geometry.geoarrow.point_coords()
# etc.
paleolimbot commented 6 months ago

Good catch! I haven't opened up the pandas integration project for a while and it may be that some of my assumptions when I wrote the initial version are no longer valid! Other than general time constraints, one of the reasons I haven't put much effort into this part of the repo is that GeoPandas is considering allowing a GeoArrow storage type along these lines, and if that's the case, I'd want geoarrow-pyarrow to just return GeoPandas objects.

(In the meantime I should definitely fix though!)

EternalDeiwos commented 6 months ago

Thanks. As I said I am just playing with it so no pressure from my side if this is going to change substantially in the future.

From my first impression, it is a lot easier to understand at a glance what geoarrow.pandas is doing under the hood than the equivalent GeoPandas. I hope wherever this lands, that it will be just as easy to directly access the underlying buffers.