duckdb / duckdb_spatial

MIT License
443 stars 33 forks source link

Inconsistent row count between North America extract and planet file using `ST_ReadOsm()` #349

Open wcedmisten opened 2 months ago

wcedmisten commented 2 months ago

Hello!

I was playing around with ST_ReadOsm() and I noticed inconsistent results for the same query when run on the North America extract compared to the Planet file. I'm wondering if this might be a bug because I'm not sure how to explain the behavior.

The north america extract returns 245,000 rows but the planet only returns 90,000. I'd expect the planet to return far more than North America. Both files were downloaded today.

D SELECT COUNT(*) FROM st_readOSM('/home/wcedmisten/Downloads/planet-240610.osm.pbf') WHERE tags['amenity'] = ['restaurant'];
100% ▕████████████████████████████████████████████████████████████▏ 
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│        89751 │
└──────────────┘
D SELECT COUNT(*) FROM st_readOSM('/home/wcedmisten/Downloads/north-america-latest.osm.pbf') WHERE tags['amenity'] = ['restaurant'];
100% ▕████████████████████████████████████████████████████████████▏ 
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│       244678 │
└──────────────┘

DuckDB version:

~/Downloads/duckdb --version
v1.0.0 1f98600c2c

OS version:

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:    22.04
Codename:   jammy
wcedmisten commented 2 months ago

Here's a specific way that is missing from the DuckDB query:

D SELECT kind, id FROM st_readOSM('/home/wcedmisten/Downloads/north-america-latest.osm.pbf') WHERE id = 18991954;
100% ▕████████████████████████████████████████████████████████████▏ 
┌──────────────────────────────────────────────┬──────────┐
│                     kind                     │    id    │
│ enum('node', 'way', 'relation', 'changeset') │  int64   │
├──────────────────────────────────────────────┼──────────┤
│ way                                          │ 18991954 │
└──────────────────────────────────────────────┴──────────┘
D SELECT kind, id FROM st_readOSM('/home/wcedmisten/Downloads/planet-240610.osm.pbf') WHERE id = 18991954;
100% ▕████████████████████████████████████████████████████████████▏ 
┌──────────────────────────────────────────────┬───────┐
│                     kind                     │  id   │
│ enum('node', 'way', 'relation', 'changeset') │ int64 │
├──────────────────────────────────────────────┴───────┤
│                        0 rows                        │
└──────────────────────────────────────────────────────┘

I verified that this way does exist via osmium, so I think this points to being a duckDB bug.

osmium getid --verbose-ids --overwrite --default-type=way ~/Downloads/planet-240610.osm.pbf 18991954 -o test.osm.bz2
[ 0:00] Started osmium getid
[ 0:00]   osmium version 1.14.0
[ 0:00]   libosmium version 2.18.0
[ 0:00] Command line options and default settings:
[ 0:00]   input options:
[ 0:00]     file name: /home/wcedmisten/Downloads/planet-240610.osm.pbf
[ 0:00]     file format: 
[ 0:00]   output options:
[ 0:00]     file name: test.osm.bz2
[ 0:00]     file format: 
[ 0:00]     generator: osmium/1.14.0
[ 0:00]     overwrite: yes
[ 0:00]     fsync: no
[ 0:00]   other options:
[ 0:00]     add referenced objects: no
[ 0:00]     work with history files: no
[ 0:00]     default object type: way
[ 0:00]     looking for these ids:
[ 0:00]       nodes:
[ 0:00]       ways: 18991954
[ 0:00]       relations:
[ 0:00] Opening input file...
[ 0:00] Opening output file...
[ 0:00] Copying matching objects to output file...
[======================================================================] 100% 
[ 2:32] Closing output file...
[ 2:32] Closing input file...
[ 2:32] Found all objects.
[ 2:32] Peak memory used: 1185 MBytes
[ 2:32] Done.
Maxxen commented 2 months ago

Hello! Thanks for raising this issue! This might very well be a bug... Where did you get the planet/NA extract from? GeoFabrik?

wcedmisten commented 2 months ago

Yeah the extract was through Geofabrik and the planet file was through the official torrent (from this page)

pilate commented 1 week ago

After noticing that planet queries were missing ~95% of the nodes, I found this issue, and decided to dig in a bit.

I believe what's happening is that only the first 'group' is being read per 'block'.

As soon as the granularity and offsets are read, the block_reader will advance beyond any remaining 'group' (2) nodes.

I was able to confirm that by removing the 'block_reader.next' calls in the ParseState::Block case, DuckDB returns the correct count of nodes/ways/relations:

Before:

│ node                                         │    470272000 │
│ way                                          │    124680000 │
│ relation                                     │        78600 │

After:

│ node                                         │   9325981030 │
│ way                                          │   1042862530 │
│ relation                                     │     12420430 │

This seems a little messy to fix, as the granularity/offsets are used during handling of 'group' messages.