duckdb / duckdb_spatial

MIT License
492 stars 41 forks source link

OOM issue importing an 84 GB OSM file into DuckDB #53

Open marklit opened 1 year ago

marklit commented 1 year ago

4 of the 5 layers will import without issue and they produced ~84 GB of DuckDB-formatted data. This 5th layer for lines causes an OOM. This was run on a machine with 64 GB of RAM.

The Overture dataset is about 1.5x bigger than the OSM extracts Geofabrik produce so I'm not sure if GDAL or anything else that reads OSM files has ever been tested thoroughly on a dataset of this size.

$ wget -c https://overturemaps-us-west-2.s3.amazonaws.com/release/2023-04-02-alpha/planet-2023-04-02-alpha.osm.pbf

$ for LAYER in lines; do
    echo "CREATE OR REPLACE TABLE osm_$LAYER AS
              SELECT *
              FROM st_read('planet-2023-04-02-alpha.osm.pbf',
                           open_options=['INTERLEAVED_READING=YES'],
                           layer='$LAYER',
                           sequential_layer_scan=true);" \
        | ~/duckdb_spatial/build/release/duckdb \
                -unsigned \
                overture.duckdb
done
Killed

These are the 4 layers that imported without issue for reference.

$ ogrinfo -ro planet-2023-04-02-alpha.osm.pbf 
INFO: Open of `planet-2023-04-02-alpha.osm.pbf'
      using driver `OSM' successful.
1: points (Point)
2: lines (Line String)
3: multilinestrings (Multi Line String)
4: multipolygons (Multi Polygon)
5: other_relations (Geometry Collection)
┌────────────┐
│ num_points │
│   int64    │
├────────────┤
│  181527241 │
└────────────┘
    osm_id = 1
      name = Monte Piselli - San Giacomo
   barrier = 
   highway = 
       ref = 
   address = 
     is_in = 
     place = 
  man_made = mast
other_tags = "tower:type"=>"communication","tower:construction"=>"lattice","frequency"=>"105.5 MHz","description"=>"Radio Subasio","communication:radio"=>"fm","communication:microwave"=>"yes"
┌──────────────────────┐
│ num_multilinestrings │
│        int64         │
├──────────────────────┤
│                14594 │
└──────────────────────┘
    osm_id = 18
      name = Section Catalonia GNR02
      type = route
other_tags = "ref"=>"TET:EU:ES:GNR:02:Catalonia","operator"=>"www.transeurotrail.org","name:ca"=>"Secció Catalunya GNR02"
┌───────────────────┐
│ num_multipolygons │
│       int64       │
├───────────────────┤
│             26159 │
└───────────────────┘
     osm_id = 11
 osm_way_id = 
       name = Tween Pond
       type = multipolygon
    aeroway = 
    amenity = 
admin_level = 
    barrier = 
   boundary = 
   building = 
      craft = 
 geological = 
   historic = 
  land_area = 
    landuse = 
    leisure = 
   man_made = 
   military = 
    natural = water
     office = 
      place = 
       shop = 
      sport = 
    tourism = 
 other_tags = 
┌─────────────────────┐
│ num_other_relations │
│        int64        │
├─────────────────────┤
│                9847 │
└─────────────────────┘
    osm_id = 2202
      name = Osterbek
      type = waterway
other_tags = "wikipedia"=>"de:Osterbek","wikidata"=>"Q321993","waterway"=>"canal","ref:fgkz"=>"595674","destination"=>"Alster"
marklit commented 1 year ago

If it's worth anything, the OSM project for importing PBFs into PG will run out of RAM on a 64 GB system with default settings https://github.com/openstreetmap/osm2pgsql/issues/1954 I'm trying their --slim and --drop flags to see if they help.

jwass commented 1 year ago

@marklit Not sure if you're interested - but as part of the Daylight Map project (daylightmap.org) we publish parquet files including geometries of the underlying OSM data. See https://daylightmap.org/daylight-osm-distribution-parquet.html for more. See also https://aws.amazon.com/blogs/publicsector/querying-daylight-openstreetmap-distribution-amazon-athena/

marklit commented 1 year ago

Thanks @jwass funny enough, I came across that dataset a few weeks ago when I was looking at the canopy one Meta published for California.

marklit commented 1 year ago

The following worked but took 5.5 days to complete.

$ osm2pgsql -d overture --slim --drop planet-2023-04-02-alpha.osm.pbf
RaczeQ commented 1 year ago

@marklit when working in a low memory environment, I've started using open_option USE_CUSTOM_INDEXING=NO. It might not be as fast but got the job done without crashing the database.

Maxxen commented 1 year ago

FWIW I'm going to look at better supporting this use case tomorrow.

Maxxen commented 1 year ago

Depending on how you plan to use OSM data you may or may not find the progress in duckdb/duckdb_spatial#90 useful. It's super fast, but it doesn't get you "all the way" in the sense that it doesn't produce complete GEOMETRY objects. We're thinking of how to actually perform the geometry reconstruction in an efficient and easy to use way but its most likely going to requires significantly more work.

marklit commented 1 year ago

The Overture canopy dataset has been taking weeks to process with PG. I'm going to try and find a DuckDB-only or DuckDB + Shapely way of getting this data into BQ. I'll report back with any findings I come across.

tobilg commented 1 year ago

I tried loading a remote PBF file, but failed for the following query:

CREATE OR REPLACE TABLE osm_points_monaco AS SELECT * FROM st_read('/vsicurl/https://download.geofabrik.de/europe/monaco-latest.osm.pbf', open_options=['INTERLEAVED_READING=YES'], layer='points', sequential_layer_scan=true);

with

IO Error: Could not open file: /vsicurl/https://download.geofabrik.de/europe/monaco-latest.osm.pbf (Could not parse configuration file for OSM import)

Do you have an idea on how to solve this @marklit / @Maxxen?

ttomasz commented 1 year ago

@tobilg check out osmconf.ini described here https://gdal.org/drivers/vector/osm.html also there is native osm pbf reader if you can live with raw objects (have to build geometries yourself)

tobilg commented 1 year ago

@ttomasz thanks for the fast reply! I was assuming that it's already included (somehow) in the build, as the above example didn't use it. Look like I'd have to ship this file with my custom spatial extension build for AWS Lambda then :-/

marklit commented 9 months ago

@tobilg I'm not sure how you'd get that call working with viscurl. I've always downloaded the OSM data as its own step first and then called DuckDB after that. duckdb/duckdb#10609 discusses how some reading functionality is handled by other parts of DuckDB, I'm not sure if calling viscurl would trigger code paths outside of the Spatial Extension.