boxdot / osmflat-rs

OpenStreetMap flatdata format and compiler
Apache License 2.0
36 stars 7 forks source link

Interest in addition of spatial index archive? #81

Open boydjohnson opened 2 months ago

boydjohnson commented 2 months ago

I thought to add a spatial index to osmflat-rs, and have it partially implemented here (https://github.com/boxdot/osmflat-rs/compare/master...boydjohnson:osmflat-rs:feature/geospatial-archive). It is implemented for Nodes and Ways, but not relations.

The index is this implementation (https://docs.rs/space-time/latest/space_time/xzorder/xz2_sfc/struct.XZ2SFC.html).

It allows for indexing Nodes, Ways, and Relations with the same curve, based on bounding box.

Aside from asking if you would consider a PR, I was wondering if there would be crates.io releases in the future for osmflat and osmflatc?

Best regards.

boxdot commented 2 months ago

We definitely would consider a PR. Also we can easily release a new version on crates.io.

VeaaC commented 2 months ago

Did you manage to also reorder nodes/way based on the index to increase data locality?

I would assume that an index for nodes is "for free" / implicit as long as their are sorted by (some) space filling curve (z-order, hilbert, etc).

Ways and relations require extra data (but there's less of them to go around).

boydjohnson commented 2 months ago

I wasn't able to reorder nodes/ways due to the lack of sort functionality on ExternalVector. I keep the ZOrderIndexEntries in an in-memory vector and sort by index at the end. I can do a US state on my laptop, but know users might want to run it on a planet file.

I asked ChatGPT to make an ascii diagram of how the ZOrderIndexEntry works.

+-----------------------------------------------------------+
|                   ZOrderIndexEntry                        |
|-----------------------------------------------------------|
| z_order (64 bits) | tag (8 bits) | idx (64 bits)          |
+-----------------------------------------------------------+
                           |
                           v
      +------------------------------------------------+
      |                GeoType (tag)                   |
      |------------------------------------------------|
      |  Node = 0    |    Way = 1    |  Relation = 2   |
      +------------------------------------------------+
            |                |                |
            v                v                v
  +----------------+  +----------------+  +----------------+
  |   Nodes Slice  |  |    Ways Slice  |  | Relations Slice|
  +----------------+  +----------------+  +----------------+
  | idx -> Node[ ] |  | idx -> Way[ ]  |  | idx -> Relation[ ] |
  +----------------+  +----------------+  +----------------+

Here's a code snippet of using osmflat to query an archive.

let curve = XZ2SFC::wgs84(Z_ORDER_RESOLUTION);

        let ranges = curve.ranges(x_min, y_min, x_max, y_max, None);

        let spatial = self.osm.spatial_index();
        let ids = self.osm.ids().unwrap();

        ranges
            .iter()
            .map(|i| (i.lower(), i.upper()))
            .flat_map(|v| {
                let element = spatial.z_order_index().partition_point(|e| e.z_order() < v.0);

                spatial.z_order_index()[element..]
                    .iter()
                    .take_while(move |e| e.z_order() <= v.1)
                    .filter_map(|entry| match entry.tag() {
                        osmflat::GeoType::Node => {
                            let item = &self.osm.nodes()[entry.value() as usize];
                            let id = &ids.nodes()[entry.value() as usize];
boydjohnson commented 2 months ago

So I had a thought. Your idea of ordering the nodes, ways, relations each in index order is a good one. It would allow for less data to be stored. To get all the nodes, ways, relations at one spatial index would require 3 binary searches.

The data that wouldn't have to be stored would be (8 bits for the GeoType, 64 bits for the index of the node/way/relation) ~100GB on a planet file.

Then all that needs to be stored would be an optional spatial_index on each Node, Way, and Relation, 64 bits.

To allow for reordering nodes, ways, relations I would use RocksDB during the compilation process, since RocksDB allows iterating all key values by a sorted key. This would allow for the data to never be held in memory, but still be sorted.

VeaaC commented 2 months ago

That sounds like a good idea. You might not even need those extra 64 bit for nodes, as the spatial index of a node is (most likely) just the morton code of its coordinates, and could be computed on-the-fly?

boydjohnson commented 2 months ago

Yeah, That makes sense. Since we have the latitude and longitude, we can calculate the index on the fly.