b-r-u / osmpbf

A Rust library for reading the OpenStreetMap PBF file format (*.osm.pbf).
Apache License 2.0
122 stars 19 forks source link

collect buffering blobs into memory #3

Closed ghost closed 4 years ago

ghost commented 4 years ago

Both for_each and par_map_reduce collect all the results into a vector before providing any results to the callback.

https://github.com/b-r-u/osmpbf/blob/31a48ba07fb23eb6fe5ce86ca6e44cdd554b5ab0/src/reader.rs#L71

To my understanding, this means that all the data from the pbf file is buffered into memory, but these pbf files can get really big (planet.osm.pbf is ~40+GB). It looks like those blobs are doing file reads but I'm not sure if it's only for header info or it's also reading the payloads into memory too.

But in any case it seems better to change

for blob in &blobs {
    match blob.decode() {

into:

for blob in self.blob_iter {
    match blob?.decode() {

unless I'm missing something? I can put together a PR if you'd like with this change and a change to par_map_reduce which I need to dig through rayon docs a bit to figure out.

Another API that might be good here is to return an iterator that does the decoding serially or in parallel. Maybe something like reader.iter() and reader.par_iter() and then you can use whichever methods the std or rayon defines on iterators from there.

ghost commented 4 years ago

Ok I did some more experimentation and it does indeed seem to try to buffer everything into memory. I ran a small test script on an old 34G planet.osm.pbf and the program got up to using 7G of RAM after about a 1m30s with no results when I stopped the program. But with these changes to for_each in the previous comment I start to get results after about 10 seconds and the memory usage stays low.

b-r-u commented 4 years ago

Thanks, this is a great PR! Using rayon's par_bridge (which was introduced in 2018) was on my to-do list for a long time. I'm glad the planet file can now be parsed in parallel without swapping!

Also, I'm open to your proposed API that returns iterators. That would be way nicer for users. I made two attempts at creating an iterator of elements for ElementReader but failed because I had to create a self-referential struct. The problem is that each Element needs a reference to its PrimitiveBlock which needs to be stored during iteration. But maybe I was missing something obvious?

ghost commented 4 years ago

Could you publish the merged version to crates? It should be totally API compatible, so a minor version bump should do.

b-r-u commented 4 years ago

0.1.16 is published!

ghost commented 4 years ago

Thanks!