duckdb / duckdb_spatial

MIT License
479 stars 35 forks source link

Performance improvements for OSM imports? #54

Open marklit opened 1 year ago

marklit commented 1 year ago

The following OSM file is 618 MB.

$ wget https://download.geofabrik.de/australia-oceania/australia-latest.osm.pbf

Below are the record counts for each layer.

Layer Record Count
Points 2,659,810
Lines 3,201,975
MultiLineStrings 3,332
MultiPolygons 23,308
Other Relations 38,863

The following ran on osm2pgsql 1.8.1, PG 14.7 and PostGIS 3.3. It finished in 5m14.235s on my 2020 MBP running on an external SSD. This includes the time it spent analysing and setting up indices.

$ time osm2pgsql -d osm_aus australia-latest.osm.pbf
2023-04-18 08:38:12  Setting up table 'planet_osm_point'
2023-04-18 08:38:12  Setting up table 'planet_osm_line'
2023-04-18 08:38:12  Setting up table 'planet_osm_polygon'
2023-04-18 08:38:12  Setting up table 'planet_osm_roads'
2023-04-18 08:40:59  Reading input files done in 167s (2m 47s).                           
2023-04-18 08:40:59    Processed 89868343 nodes in 14s - 6419k/s
2023-04-18 08:40:59    Processed 6939120 ways in 114s (1m 54s) - 61k/s
2023-04-18 08:40:59    Processed 167603 relations in 39s - 4k/s
2023-04-18 08:41:02  Clustering table 'planet_osm_roads' by geometry...
2023-04-18 08:41:02  Clustering table 'planet_osm_line' by geometry...
2023-04-18 08:41:02  Clustering table 'planet_osm_point' by geometry...
2023-04-18 08:41:02  Clustering table 'planet_osm_polygon' by geometry...
2023-04-18 08:41:24  Creating geometry index on table 'planet_osm_roads'...
2023-04-18 08:41:28  Creating geometry index on table 'planet_osm_point'...
2023-04-18 08:41:31  Analyzing table 'planet_osm_roads'...
2023-04-18 08:42:06  Analyzing table 'planet_osm_point'...
2023-04-18 08:42:06  All postprocessing on table 'planet_osm_point' done in 64s (1m 4s).
2023-04-18 08:42:20  Creating geometry index on table 'planet_osm_polygon'...
2023-04-18 08:42:22  Creating geometry index on table 'planet_osm_line'...
2023-04-18 08:43:15  Analyzing table 'planet_osm_line'...
2023-04-18 08:43:18  All postprocessing on table 'planet_osm_line' done in 135s (2m 15s).
2023-04-18 08:43:23  Analyzing table 'planet_osm_polygon'...
2023-04-18 08:43:26  All postprocessing on table 'planet_osm_polygon' done in 143s (2m 23s).
2023-04-18 08:43:26  All postprocessing on table 'planet_osm_roads' done in 29s.
2023-04-18 08:43:26  osm2pgsql took 314s (5m 14s) overall.
osm_aus=# \d+
                                         List of relations
 Schema |        Name        | Type  | Owner | Persistence | Access method |  Size   | Description 
--------+--------------------+-------+-------+-------------+---------------+---------+-------------
 public | geography_columns  | view  | mark  | permanent   |               | 0 bytes | 
 public | geometry_columns   | view  | mark  | permanent   |               | 0 bytes | 
 public | planet_osm_line    | table | mark  | permanent   | heap          | 1335 MB | 
 public | planet_osm_point   | table | mark  | permanent   | heap          | 210 MB  | 
 public | planet_osm_polygon | table | mark  | permanent   | heap          | 1458 MB | 
 public | planet_osm_roads   | table | mark  | permanent   | heap          | 229 MB  | 
 public | spatial_ref_sys    | table | mark  | permanent   | heap          | 6936 kB | 

I can't see a specific 'other relations' table but I can't rule out that data being used elsewhere. This is an example 'other relations' table.

    osm_id = 2272
      name = Causeway Bridge (Victoria Park End)
      type = bridge
other_tags = "layer"=>"1","length"=>"225 m","wikidata"=>"Q54378072"

The following finished in a little over 9 minutes producing a 1.6 GB DuckDB file on the same system running the latest master branch compiled for release. The file is half the size of the PG DB but this is almost 2x slower that the PG toolset.

LAYERS="points lines multilinestrings multipolygons other_relations"

for LAYER in $LAYERS; do
    time echo "CREATE OR REPLACE TABLE osm_$LAYER AS
              SELECT *
              FROM st_read('australia-latest.osm.pbf',
                           open_options=['INTERLEAVED_READING=YES'],
                           layer='$LAYER',
                           sequential_layer_scan=true);" \
        | /Volumes/Seagate/duckdb_spatial/build/release/duckdb \
                -unsigned \
                aus2.duckdb
done

The is an import time breakdown by layer.

Layer Duration
points 0m31.640s
lines 2m59.783s
multilinestrings 1m54.469s
multipolygons 1m58.296s
other_relations 1m38.887s
Maxxen commented 1 year ago

Not unexpected, there's a couple of reasons why we're much slower than osm2pgsql

I have a prototype of a native parallel osm scanner that avoid unnecessary allocations (should fix #53) but it can't produce GEOMETRY data yet, it just outputs all the entities. The plan is to perform the geometry construction using a DuckDB JOIN, but we need to do some optimiser work to enable the hash partitioning to kick in.

marklit commented 8 months ago

You probably saw this already but @RaczeQ built an OSM reader using PyArrow and it's working across multiple cores much faster than GDAL's OSM reader. He even found GDAL was doing some funny stuff with invalid geometry and he's apparently taken a better approach to this.

I'm using your Spatial extension in osm_split atm. I'm on the fence if I should re-arch it to use quackosm's PyArrow code but ideally, the Spatial extension should be upgraded to this multi-core-friendly approach to decoding OSM files.

RaczeQ commented 8 months ago

Hi @marklit, thank you for mentioning QuackOSM, but I think there is some misunderstanding 😄 QuackOSM utilizes mainly the Spatial extension but reads the data using the ST_ReadOSM function with A LOT of logic written in SQL. So the multi-core performance comes from the internals of the DuckDB engine, not Arrow. It utilizes the pyarrow library, that's true, but it's mainly used to merge files, remove duplicates or save the file in the geoparquet format.

marklit commented 8 months ago

Thanks for clarifying that. I hadn't had time to read your codebase in detail.

marklit commented 6 months ago

I had a look at performance with the current Spatial extension and DuckDB 0.10.1.

I used the following PBF for testing.

$ wget https://download.geofabrik.de/asia/japan/kanto-latest.osm.pbf

When not run via perf, DuckDB read 374 MB of PBF at 82 MB/s and wrote to CSV at 30 MB/s. This is on a SATA-backed SSD that can normally do ~5-600 MB/s. It was run in an Ubuntu 22 VM w/ 4 cores from an i9-14900K and 8 GB of DDR5 RAM clocked at 5,600 MT/s.

$ perf record \
    --call-graph dwarf \
    -- \
    ~/duckdb -c 'COPY(SELECT * FROM ST_READ("kanto-latest.osm.pbf", open_options=["INTERLEAVED_READING=YES"])) TO "/dev/null"'
$ sudo perf script -f \
    | ~/FlameGraph/stackcollapse-perf.pl \
    > out.perf-folded
$ ~/FlameGraph/flamegraph.pl \
    --width 2400 \
    out.perf-folded \
    > kanto.svg

spatial_osm_perf kanto

These were the areas where the most time was spent that I was hoping could see speed-ups.

gdal_inflate 27%
    10% gdal_adler32_z
    90% gdal_inflate_fast

OGROSMLayer::GetNextFeature 22%
    OGROSMDataSource::ParseNextChunk ~100%
        ...
        ReadDenseNodes ~100%
            OGROSMDataSource::NotifyNodes
                OGROSMDataSource::IndexPointCustom
                    OGROSMDataSource::FlushCurrentSector ~33%
                    OGROSMDataSource::GetBucket 70%

The zlib code in GDAL looks as optimised as it'll get. Only certain operations within the scheme can take advantage of multiple cores.

PBF files using DEFLATE is the bottleneck here and I'm not sure if other schemes could be supported as well. This could be a quick win if they could and GeoFabrik were onboard with the idea. OSM's PBF supports LZMA, LZ4 and ZSTD as well https://wiki.openstreetmap.org/wiki/PBF_Format

lseek is only called 40 times in total so there isn't a random I/O issue with a reading jumping around the PBF file.

$ strace -wc ~/duckdb -c 'COPY(SELECT * FROM ST_READ("kanto-latest.osm.pbf", open_options=["INTERLEAVED_READING=YES"])) TO "/dev/null"'
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 58.33    3.118965        3564       875        28 futex
 34.82    1.861906          34     54585           write
  1.84    0.098175           8     11055           read
  1.69    0.090318         157       573           close
  0.55    0.029462         117       250           mmap
  0.54    0.028736          50       571         1 openat
  0.48    0.025631         187       137           munmap
  0.44    0.023430         223       105           brk
  0.43    0.022846          40       568           newfstatat
  0.25    0.013128         115       114           getpid
  0.15    0.008118          36       221           prlimit64
  0.09    0.004900          44       111           sysinfo
  0.08    0.004305         107        40           lseek
  0.06    0.003393         147        23           rt_sigprocmask
  0.04    0.002315         210        11           clone3
  0.04    0.002297          22       101           mremap
  0.04    0.002027         155        13           mprotect
  0.02    0.001194         170         7         4 access
  0.02    0.001015        1014         1           execve
  0.01    0.000794         113         7         2 stat
  0.01    0.000683         113         6           getrusage
  0.01    0.000563         140         4           pread64
  0.01    0.000518         259         2         1 arch_prctl
  0.01    0.000443         110         4           sched_getaffinity
  0.01    0.000429         107         4           fstat
  0.00    0.000258         129         2           socket
  0.00    0.000253         126         2           rt_sigaction
  0.00    0.000245         122         2         1 readlink
  0.00    0.000227         113         2         2 connect
  0.00    0.000218         108         2           ioctl
  0.00    0.000170         169         1           getrandom
  0.00    0.000111         110         1           getuid
  0.00    0.000110         110         1           set_tid_address
  0.00    0.000110         109         1           set_robust_list
  0.00    0.000110         109         1           rseq
------ ----------- ----------- --------- --------- ------------------
100.00    5.347404          77     69403        39 total

The vast majority of PBF read calls are 20-50K so there doesn't seem to be a small buffer issue either. The 1,024 and 6,000-byte reads happen 120 times which isn't much.

$ strace -e trace=read \
    ~/duckdb -c 'COPY(SELECT * 
                      FROM ST_READ("kanto-latest.osm.pbf",
                                   open_options=["INTERLEAVED_READING=YES"])) 
                 TO "/dev/null"' 2>&1 \
    | grep -o ' = [0-9]*' \
    | sort \
    | uniq -c \
    | sort -rnn
   2357  = 4
   2048  = 13
    310  = 0
    104  = 1024
    102  = 47
     16  = 6000
      6  = 832
      3  = 31291
      2  = 60
      2  = 58397
      2  = 57159
      2  = 56525
      2  = 53043
      2  = 52576
      2  = 52266
      2  = 50790
      2  = 50666
      2  = 49186
...

I welcome anyone's input into the above. If / when I come up with other avenues of investigation or performance improvement ideas I'll leave them here.