Closed ghost closed 4 years ago
Ok I did some more experimentation and it does indeed seem to try to buffer everything into memory. I ran a small test script on an old 34G planet.osm.pbf and the program got up to using 7G of RAM after about a 1m30s with no results when I stopped the program. But with these changes to for_each
in the previous comment I start to get results after about 10 seconds and the memory usage stays low.
Thanks, this is a great PR! Using rayon's par_bridge
(which was introduced in 2018) was on my to-do list for a long time. I'm glad the planet file can now be parsed in parallel without swapping!
Also, I'm open to your proposed API that returns iterators. That would be way nicer for users. I made two attempts at creating an iterator of elements for ElementReader
but failed because I had to create a self-referential struct. The problem is that each Element
needs a reference to its PrimitiveBlock
which needs to be stored during iteration. But maybe I was missing something obvious?
Could you publish the merged version to crates? It should be totally API compatible, so a minor version bump should do.
0.1.16 is published!
Thanks!
Both
for_each
andpar_map_reduce
collect all the results into a vector before providing any results to the callback.https://github.com/b-r-u/osmpbf/blob/31a48ba07fb23eb6fe5ce86ca6e44cdd554b5ab0/src/reader.rs#L71
To my understanding, this means that all the data from the pbf file is buffered into memory, but these pbf files can get really big (planet.osm.pbf is ~40+GB). It looks like those blobs are doing file reads but I'm not sure if it's only for header info or it's also reading the payloads into memory too.
But in any case it seems better to change
into:
unless I'm missing something? I can put together a PR if you'd like with this change and a change to
par_map_reduce
which I need to dig through rayon docs a bit to figure out.Another API that might be good here is to return an iterator that does the decoding serially or in parallel. Maybe something like
reader.iter()
andreader.par_iter()
and then you can use whichever methods the std or rayon defines on iterators from there.