boxdot / osmflat-rs

OpenStreetMap flatdata format and compiler
Apache License 2.0
37 stars 7 forks source link

Sorting vectors #66

Open hallahan opened 2 years ago

hallahan commented 2 years ago

Right now, it looks like we can only build a Vec by appending. It would be nice to be able to sort our vectors after they have been built. For example, I may want to compute a hilbert index for the nodes and sort them by that. That way, I can access the data with good locality.

Is there a straightforward way to sort?

boxdot commented 2 years ago

Usually, what we sort data in memory and then write it to disk with e.g. ExternalVector. We had some experimental work on sorting ExternalVector externally, which is useful when the data does not fit on disk, but unfortunately we never finished it.

hallahan commented 2 years ago

You would need a beefy machine to sort the OSM planet. There are 7_992_520_838 nodes. Just sorting a 64bit index would be ~60GB of RAM. It would be cool if that wasn't needed, because I can run your tool otherwise on my laptop.

What is the challenge behind making vectors backed by memmap mutable?

VeaaC commented 2 years ago

There are several problems with memory mapping mutable vectors: A) It violates Rust's safety guarantees: A mutable reference invalidates all other references, but nothing stops another applications from opening the same file. We already violate this slightly by read-only memory mapping (somebody could manually open it writeable). B) Just because it is mutably memory mapped doesn't mean you can sort it easily: Many sorting algorithms are not in-place, or optimized for external-memory sorting C) Some flatdata structs are not self-contained, they also read data from the "next" item in the vector (@range attribution does that). Those types of structs must not be re-ordered.

That's why our general plan was to rather extend ExternalVector with a dedicated sorting method that is optimized for external memory, and can only be called when finalizing the vector (e.g. super-scalar-sample-sort, see https://github.com/lorenzhs/ssssort ), and prevents being used in case the struct is not self-contained.

hallahan commented 2 years ago

The reason I ask is that I'm experimenting with creating tile indices using a Hilbert Curve. I am converting Node LatLons to their corresponding Hilbert number, then sorting Nodes by that. For ways, I do the same thing, with the Hilbert number derived from their Point on Surface.

osmflat-rs seems like a compelling fit for a back-end format, but an easy way of memmap sorting is needed.

hallahan commented 2 years ago

Here's a general idea behind sorting vectors. Mutable accesors could be built into the storage code in flatdata to allow this sort of thing and maybe provide some guardrails...

https://github.com/boxdot/osmflat-rs/pull/74

It would be super cool if you could annotate the schema to make the sorting populate throughout all references from the sorted vector.

I haven't yet tried sorting the nodes themselves yet.

VeaaC commented 2 years ago

Building sorting into flatdata itself is definitely a feature we are looking forward to. Automatically updating any references is most likely not going to happen, since such a thing usually requires having a large in-memory index that remaps references, and also reordering data references with @range.

The nodes in osmflat itself are not sortable directly, though (as they have @range(tags) tag_first_idx: u64 : 40;), so one would most likely create a temporary intermediate storage for sorting.

Regarding the usability of osmflat as a backend format: While it can word for simple use cases I think it is much better suited as a input format to build your backend format: Most backends will need some of:

In my opinion osmflat is rather suited as an input format to build your own backend format out of: It is much easier to consume than PBF, requires less RAM (due to references being already resolved). We were mostly designing it as a data-exchange format replacing PBF.

hallahan commented 2 years ago

Extending ExternalVector with a sort is a pretty interesting idea. Not sure if there is a parent issue on this.

Have you thought about radix sort?

https://docs.rs/radsort/latest/radsort/index.html

VeaaC commented 2 years ago

Radix sort is usually not the best choice if you need to sort data that does not fit into RAM: You would need one pass over the data for each 4/8/16 bits (whatever your word size is). E.g. assuming 16 bits words, and 32 byte structures you would need up to 16 passes of reading/writing the data. Some of the later passes might fit into memory, but it mostly depends on the data.

Compare that e.g. with (assuming that your data size is less than RAM ^ 1.5, e.g. 4 GB RAM -> less than 281 PB data ):

hallahan commented 2 years ago

Makes sense. Basically, the sort needs to be localized first to reduce paging.