OvertureMaps / overturemaps-py

overture-py
MIT License
125 stars 17 forks source link

Using DuckDB #14

Closed cheginit closed 4 months ago

cheginit commented 4 months ago

While I was updating a blog post that I wrote a while back on subsetting Overture data using DuckDB, I stumbled upon this package. I noticed, this package uses pyarrow. I thought you might be interested in exploring this alternative approach, as it might have some benefits, especially for large requests. Here's the link to my short blog post containing the code.

jwass commented 4 months ago

Thanks @cheginit. We are big fans of duckdb and have some documentation about how to use it to get data out https://docs.overturemaps.org/getting-data/locally/. I opted to use pyarrow here though. I'll close this out.

By the way - I think your query in your post has a bug

 SELECT
        data.*,
        ST_GeomFromWKB(data.geometry) as geometry,
    FROM data_view AS data
...

will have 2 geometry columns since data.* also has one. I ran it to verify the output parquet file has both a geometry and geometry_1 column. I think you can use duckdb's EXCLUDE clause to omit it from the first part. But since geometry is already a wkb I don't think you need anything other than data.* there.

cheginit commented 4 months ago

Thanks for catching the bug and the link, I didn't realize there's one!

May I ask why did you opt for using pyarrow?