geoarrow / geoarrow-python

Python implementation of the GeoArrow specification
http://geoarrow.org/geoarrow-python/
Apache License 2.0
59 stars 3 forks source link

feat(geoarrow-pyarrow): Implement single-geometry encodings for geoparquet writer #41

Closed paleolimbot closed 4 months ago

paleolimbot commented 9 months ago
import geoarrow.pyarrow as ga
from geoarrow.pyarrow import io
from pyarrow import parquet

# Read file
points_tbl = io.read_pyogrio_table(
    "https://github.com/geoarrow/geoarrow-data/releases/download/v0.1.0/ns-water-water_junc.fgb.zip",
    columns=["geometry"],
)

# Encode as point instead of multipoint
points_tbl = points_tbl.set_column(
    0, "geometry", ga.make_point(*ga.point_coords(points_tbl["geometry"]))
)

# Generally faster to do things with actual values (i.e., skip WKB parsing)
def read_and_do_something_with_values(f):
    tbl = io.read_geoparquet_table(f)
    ga.box_agg(tbl["geometry"])

io.write_geoparquet_table(points_tbl, "test.parquet", geometry_encoding="WKB")
%timeit read_and_do_something_with_values("test.parquet")
#> 10.6 ms ± 87.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

io.write_geoparquet_table(
    points_tbl,
    "test.parquet",
    geometry_encoding=io.geoparquet_encoding_geoarrow()
)
%timeit read_and_do_something_with_values("test.parquet")
#> 3.06 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Column statistics built in
io.write_geoparquet_table(
    points_tbl, 
    "test.parquet",
    geometry_encoding=io.geoparquet_encoding_geoarrow(),
    row_group_size=4 * 65535
)

pq_file = parquet.ParquetFile("test.parquet")
col_index = pq_file.schema_arrow.get_field_index("geometry")
boxes = []
for i in range(pq_file.num_row_groups):
    metadata = pq_file.metadata.row_group(i)
    stats_x = metadata.column(col_index).statistics
    stats_y = metadata.column(col_index + 1).statistics

    if stats_x is None or stats_y is None:
        boxes.append(None)
    else:
        boxes.append(
            {
                "xmin": stats_x.min,
                "xmax": stats_x.max,
                "ymin": stats_y.min,
                "ymax": stats_y.max,
            }
        )

boxes
#> [{'xmin': 229121.2503498519,
#>   'xmax': 758291.4287697164,
#>   'ymin': 4807526.337270797,
#>   'ymax': 5190760.16706287},
#>  {'xmin': 302024.3820995989,
#>   'xmax': 716864.4139822095,
#>   'ymin': 4919890.0309867,
#>   'ymax': 5234346.876916235}]
codecov[bot] commented 9 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 95.98%. Comparing base (0a95d5f) to head (0725b24).

:exclamation: Current head 0725b24 differs from pull request most recent head 5d9d519. Consider uploading reports for the commit 5d9d519 to get more accurate results

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #41 +/- ## ========================================== + Coverage 95.62% 95.98% +0.36% ========================================== Files 10 10 Lines 1462 1496 +34 ========================================== + Hits 1398 1436 +38 + Misses 64 60 -4 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.