geoarrow / geoarrow-python

Python implementation of the GeoArrow specification
http://geoarrow.org/geoarrow-python/
Apache License 2.0
58 stars 3 forks source link

Filter geometries based on type #46

Open RaczeQ opened 3 months ago

RaczeQ commented 3 months ago

Hi, I'm wondering if it would be possible to have a WkbType column and filter out geometries based on a given type (Point, LineString, Polygon etc). There are some compute functions available, there even is unique_geometry_types, but I'm not sure if any of those could help me in my use case.

kylebarron commented 3 months ago

If you can access the indices of each geometry type, then you can do something like in https://github.com/developmentseed/lonboard/issues/491 with pyarrow.Table.take instead of DataFrame.iloc

paleolimbot commented 3 months ago

You're definitely right that something like geoarrow.pyarrow.geometry_type(x) (returning something the same length as x) would be a very helpful compute function for a lot of reasons. It's possible to do this using purely pyarrow compute, although we clearly have the ability to do this more efficiently/generically (since we can compute the unique geometry types), it's just not wired up yet.

import geoarrow.pyarrow as ga
import pyarrow as pa
import pyarrow.compute as pc

wkbs = ga.as_wkb(["POINT (0 1)", "LINESTRING Z (0 0 1, 1 1 2)", "MULTIPOINT (0 0, 1 1)"])

# Doesn't work with nulls
assert wkbs.null_count == 0

# Only works with little-endian WKB
endian_byte = pc.binary_slice(wkbs.storage, 0, 1)
endian = pa.Array.from_buffers(pa.int8(), len(endian_byte), [endian_byte.buffers()[0], endian_byte.buffers()[2]])
assert pc.all(pc.equal(endian, 1)).as_py()

wkb_type_bytes = pc.binary_slice(wkbs.storage, 1, 5)
geometry_type = pa.Array.from_buffers(pa.uint32(), len(wkb_type_bytes), [wkb_type_bytes.buffers()[0], wkb_type_bytes.buffers()[2]])

# Might have to do some extra work if you are expecting ZM WKB
one_thousand = pa.scalar(1000, pa.uint32())
geometry_type = pc.subtract(geometry_type, pc.multiply(pc.divide(geometry_type, one_thousand), one_thousand))

# If you're expecting EWKB you might also have to mask off the high bits
mask = pa.scalar(0x00FFFFFF, pa.uint32())
geometry_type = pc.bit_wise_and(geometry_type, mask)

geometry_type
#> <pyarrow.lib.UInt32Array object at 0x1135c5de0>
#> [
#>   1,
#>   2,
#>   4
#> ]