Open marklit opened 1 year ago
If it's worth anything, the OSM project for importing PBFs into PG will run out of RAM on a 64 GB system with default settings https://github.com/openstreetmap/osm2pgsql/issues/1954 I'm trying their --slim
and --drop
flags to see if they help.
@marklit Not sure if you're interested - but as part of the Daylight Map project (daylightmap.org) we publish parquet files including geometries of the underlying OSM data. See https://daylightmap.org/daylight-osm-distribution-parquet.html for more. See also https://aws.amazon.com/blogs/publicsector/querying-daylight-openstreetmap-distribution-amazon-athena/
Thanks @jwass funny enough, I came across that dataset a few weeks ago when I was looking at the canopy one Meta published for California.
The following worked but took 5.5 days to complete.
$ osm2pgsql -d overture --slim --drop planet-2023-04-02-alpha.osm.pbf
@marklit when working in a low memory environment, I've started using open_option USE_CUSTOM_INDEXING=NO
. It might not be as fast but got the job done without crashing the database.
FWIW I'm going to look at better supporting this use case tomorrow.
Depending on how you plan to use OSM data you may or may not find the progress in duckdb/duckdb_spatial#90 useful. It's super fast, but it doesn't get you "all the way" in the sense that it doesn't produce complete GEOMETRY
objects. We're thinking of how to actually perform the geometry reconstruction in an efficient and easy to use way but its most likely going to requires significantly more work.
The Overture canopy dataset has been taking weeks to process with PG. I'm going to try and find a DuckDB-only or DuckDB + Shapely way of getting this data into BQ. I'll report back with any findings I come across.
I tried loading a remote PBF file, but failed for the following query:
CREATE OR REPLACE TABLE osm_points_monaco AS SELECT * FROM st_read('/vsicurl/https://download.geofabrik.de/europe/monaco-latest.osm.pbf', open_options=['INTERLEAVED_READING=YES'], layer='points', sequential_layer_scan=true);
with
IO Error: Could not open file: /vsicurl/https://download.geofabrik.de/europe/monaco-latest.osm.pbf (Could not parse configuration file for OSM import)
Do you have an idea on how to solve this @marklit / @Maxxen?
@tobilg check out osmconf.ini described here https://gdal.org/drivers/vector/osm.html also there is native osm pbf reader if you can live with raw objects (have to build geometries yourself)
@ttomasz thanks for the fast reply! I was assuming that it's already included (somehow) in the build, as the above example didn't use it. Look like I'd have to ship this file with my custom spatial extension build for AWS Lambda then :-/
@tobilg I'm not sure how you'd get that call working with viscurl. I've always downloaded the OSM data as its own step first and then called DuckDB after that. duckdb/duckdb#10609 discusses how some reading functionality is handled by other parts of DuckDB, I'm not sure if calling viscurl would trigger code paths outside of the Spatial Extension.
4 of the 5 layers will import without issue and they produced ~84 GB of DuckDB-formatted data. This 5th layer for
lines
causes an OOM. This was run on a machine with 64 GB of RAM.The Overture dataset is about 1.5x bigger than the OSM extracts Geofabrik produce so I'm not sure if GDAL or anything else that reads OSM files has ever been tested thoroughly on a dataset of this size.
These are the 4 layers that imported without issue for reference.