geoarrow / geoarrow-python

Python implementation of the GeoArrow specification
http://geoarrow.org/geoarrow-python/
Apache License 2.0
65 stars 4 forks source link

Error initializing a geoarrow table from pyarrow.lib.ChunkedArray #33

Closed ingenieroariel closed 11 months ago

ingenieroariel commented 11 months ago

I am trying to load a csv to a geoarrow table manually using pyarrow but got an error

import gzip
import geoarrow.pyarrow as ga
import pyarrow.csv as pv

with gzip.open("/Users/x/data/points_s2_level_4_gzip/397_buildings.csv.gz") as fp:
        table = pv.read_csv(fp)

points = ga.point().from_geobuffers(None, table["latitude"], y=table["longitude"])

Screenshot 2023-10-27 at 10 15 48 AM

ingenieroariel commented 11 months ago

I also tried:

points = ga.point().from_geobuffers(None, table["latitude"].combine_chunks(), y=table["longitude"].combine_chunks())

and got


TypeError Traceback (most recent call last) Cell In[16], line 1 ----> 1 points = ga.point().from_geobuffers(None, table["latitude"].combine_chunks(), y=table["longitude"].combine_chunks())

File ~/tmp/lib/python3.11/site-packages/geoarrow/pyarrow/_type.py:289, in PointType.from_geobuffers(self, validity, x, y, z_or_m, m) 280 def from_geobuffers(self, validity, x, y=None, z_or_m=None, m=None): 281 buffers = [ 282 (0, "uint8", validity), 283 (1, "double", x), (...) 286 (4, "double", m), 287 ] --> 289 return self._from_geobuffers_internal(buffers)

File ~/tmp/lib/python3.11/site-packages/geoarrow/pyarrow/_type.py:94, in GeometryExtensionType._from_geobuffers_internal(self, args) 92 continue 93 else: ---> 94 builder.set_buffer_double(i, buf) 96 carray = builder.finish() 97 return pa.Array._import_from_c(carray._addr(), self)

File src/geoarrow/c/_lib.pyx:674, in geoarrow.c._lib.CBuilder.set_buffer_double()

TypeError: memoryview: a bytes-like object is required, not 'pyarrow.lib.DoubleArray'

paleolimbot commented 11 months ago

That's a great point! I don't think we have shortcut for point creation from chunked arrays yet. The workaround is:

import pyarrow as pa
import geoarrow.pyarrow as ga

tbl = pa.table([pa.array([0.0, 1.0]), pa.array([1.0, 2.0])], names=["x", "y"])

struct_chunks = []
for x_chunk, y_chunk in zip(tbl["x"].chunks, tbl["y"].chunks):
    struct_chunk = pa.StructArray.from_arrays([x_chunk, y_chunk], names=["x", "y"])
    struct_chunks.append(struct_chunk)

points = ga.point().wrap_array(pa.chunked_array(struct_chunks))
points
#> <pyarrow.lib.ChunkedArray object at 0x1247f1a80>

points.type
#> PointType(geoarrow.point)
jorisvandenbossche commented 11 months ago

It's not only chunked arrays, but pyarrow arrays in general that don't work for from_geobuffers. Another workaround for now is to convert each column to a numpy array:

points = ga.point().from_geobuffers(None, table["latitude"].to_numpy(), y=table["longitude"].to_numpy())
jorisvandenbossche commented 11 months ago

BTW, note that you should switch around the order of latitude and longitude! (geoarrow always uses x/y or lon/lat order, regardless of the coordinate reference system)

ingenieroariel commented 11 months ago

That's a great point! I don't think we have shortcut for point creation from chunked arrays yet. The workaround is:

import pyarrow as pa
import geoarrow.pyarrow as ga

tbl = pa.table([pa.array([0.0, 1.0]), pa.array([1.0, 2.0])], names=["x", "y"])

struct_chunks = []
for x_chunk, y_chunk in zip(tbl["x"].chunks, tbl["y"].chunks):
    struct_chunk = pa.StructArray.from_arrays([x_chunk, y_chunk], names=["x", "y"])
    struct_chunks.append(struct_chunk)

points = ga.point().wrap_array(pa.chunked_array(struct_chunks))
points
#> <pyarrow.lib.ChunkedArray object at 0x1247f1a80>

points.type
#> PointType(geoarrow.point)

This workaround worked for me, and it was super fast, this tech is magic.

paleolimbot commented 11 months ago

Reopening because we should really have this helper in geoarrow.pyarrow!