Open marklit opened 1 year ago
Not unexpected, there's a couple of reasons why we're much slower than osm2pgsql
st_read
is single threaded, and it has to be due to GDAL.I have a prototype of a native parallel osm scanner that avoid unnecessary allocations (should fix #53) but it can't produce GEOMETRY
data yet, it just outputs all the entities. The plan is to perform the geometry construction using a DuckDB JOIN, but we need to do some optimiser work to enable the hash partitioning to kick in.
You probably saw this already but @RaczeQ built an OSM reader using PyArrow and it's working across multiple cores much faster than GDAL's OSM reader. He even found GDAL was doing some funny stuff with invalid geometry and he's apparently taken a better approach to this.
I'm using your Spatial extension in osm_split atm. I'm on the fence if I should re-arch it to use quackosm's PyArrow code but ideally, the Spatial extension should be upgraded to this multi-core-friendly approach to decoding OSM files.
Hi @marklit, thank you for mentioning QuackOSM, but I think there is some misunderstanding 😄
QuackOSM utilizes mainly the Spatial extension but reads the data using the ST_ReadOSM
function with A LOT of logic written in SQL. So the multi-core performance comes from the internals of the DuckDB engine, not Arrow.
It utilizes the pyarrow library, that's true, but it's mainly used to merge files, remove duplicates or save the file in the geoparquet format.
Thanks for clarifying that. I hadn't had time to read your codebase in detail.
I had a look at performance with the current Spatial extension and DuckDB 0.10.1.
I used the following PBF for testing.
$ wget https://download.geofabrik.de/asia/japan/kanto-latest.osm.pbf
When not run via perf, DuckDB read 374 MB of PBF at 82 MB/s and wrote to CSV at 30 MB/s. This is on a SATA-backed SSD that can normally do ~5-600 MB/s. It was run in an Ubuntu 22 VM w/ 4 cores from an i9-14900K and 8 GB of DDR5 RAM clocked at 5,600 MT/s.
$ perf record \
--call-graph dwarf \
-- \
~/duckdb -c 'COPY(SELECT * FROM ST_READ("kanto-latest.osm.pbf", open_options=["INTERLEAVED_READING=YES"])) TO "/dev/null"'
$ sudo perf script -f \
| ~/FlameGraph/stackcollapse-perf.pl \
> out.perf-folded
$ ~/FlameGraph/flamegraph.pl \
--width 2400 \
out.perf-folded \
> kanto.svg
These were the areas where the most time was spent that I was hoping could see speed-ups.
gdal_inflate 27%
10% gdal_adler32_z
90% gdal_inflate_fast
OGROSMLayer::GetNextFeature 22%
OGROSMDataSource::ParseNextChunk ~100%
...
ReadDenseNodes ~100%
OGROSMDataSource::NotifyNodes
OGROSMDataSource::IndexPointCustom
OGROSMDataSource::FlushCurrentSector ~33%
OGROSMDataSource::GetBucket 70%
The zlib code in GDAL looks as optimised as it'll get. Only certain operations within the scheme can take advantage of multiple cores.
PBF files using DEFLATE is the bottleneck here and I'm not sure if other schemes could be supported as well. This could be a quick win if they could and GeoFabrik were onboard with the idea. OSM's PBF supports LZMA, LZ4 and ZSTD as well https://wiki.openstreetmap.org/wiki/PBF_Format
lseek
is only called 40 times in total so there isn't a random I/O issue with a reading jumping around the PBF file.
$ strace -wc ~/duckdb -c 'COPY(SELECT * FROM ST_READ("kanto-latest.osm.pbf", open_options=["INTERLEAVED_READING=YES"])) TO "/dev/null"'
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ------------------
58.33 3.118965 3564 875 28 futex
34.82 1.861906 34 54585 write
1.84 0.098175 8 11055 read
1.69 0.090318 157 573 close
0.55 0.029462 117 250 mmap
0.54 0.028736 50 571 1 openat
0.48 0.025631 187 137 munmap
0.44 0.023430 223 105 brk
0.43 0.022846 40 568 newfstatat
0.25 0.013128 115 114 getpid
0.15 0.008118 36 221 prlimit64
0.09 0.004900 44 111 sysinfo
0.08 0.004305 107 40 lseek
0.06 0.003393 147 23 rt_sigprocmask
0.04 0.002315 210 11 clone3
0.04 0.002297 22 101 mremap
0.04 0.002027 155 13 mprotect
0.02 0.001194 170 7 4 access
0.02 0.001015 1014 1 execve
0.01 0.000794 113 7 2 stat
0.01 0.000683 113 6 getrusage
0.01 0.000563 140 4 pread64
0.01 0.000518 259 2 1 arch_prctl
0.01 0.000443 110 4 sched_getaffinity
0.01 0.000429 107 4 fstat
0.00 0.000258 129 2 socket
0.00 0.000253 126 2 rt_sigaction
0.00 0.000245 122 2 1 readlink
0.00 0.000227 113 2 2 connect
0.00 0.000218 108 2 ioctl
0.00 0.000170 169 1 getrandom
0.00 0.000111 110 1 getuid
0.00 0.000110 110 1 set_tid_address
0.00 0.000110 109 1 set_robust_list
0.00 0.000110 109 1 rseq
------ ----------- ----------- --------- --------- ------------------
100.00 5.347404 77 69403 39 total
The vast majority of PBF read calls are 20-50K so there doesn't seem to be a small buffer issue either. The 1,024 and 6,000-byte reads happen 120 times which isn't much.
$ strace -e trace=read \
~/duckdb -c 'COPY(SELECT *
FROM ST_READ("kanto-latest.osm.pbf",
open_options=["INTERLEAVED_READING=YES"]))
TO "/dev/null"' 2>&1 \
| grep -o ' = [0-9]*' \
| sort \
| uniq -c \
| sort -rnn
2357 = 4
2048 = 13
310 = 0
104 = 1024
102 = 47
16 = 6000
6 = 832
3 = 31291
2 = 60
2 = 58397
2 = 57159
2 = 56525
2 = 53043
2 = 52576
2 = 52266
2 = 50790
2 = 50666
2 = 49186
...
I welcome anyone's input into the above. If / when I come up with other avenues of investigation or performance improvement ideas I'll leave them here.
The following OSM file is 618 MB.
Below are the record counts for each layer.
The following ran on osm2pgsql 1.8.1, PG 14.7 and PostGIS 3.3. It finished in 5m14.235s on my 2020 MBP running on an external SSD. This includes the time it spent analysing and setting up indices.
I can't see a specific 'other relations' table but I can't rule out that data being used elsewhere. This is an example 'other relations' table.
The following finished in a little over 9 minutes producing a 1.6 GB DuckDB file on the same system running the latest master branch compiled for release. The file is half the size of the PG DB but this is almost 2x slower that the PG toolset.
The is an import time breakdown by layer.