digidem / osm-p2p-db

Peer-to-peer database for OpenStreetMap data
BSD 2-Clause "Simplified" License
237 stars 25 forks source link

Performance/optimization opportunities #13

Open gmaclennan opened 8 years ago

gmaclennan commented 8 years ago

We're probably too early to start thinking about optimizations, but these ideas/questions were on my mind so leaving this here for discussion.

  1. As far as I understand the code, for the indexes we are creating look-ups by id to the original value. What performance might be gained from creating complete 'materialized views' with a copy of the original data, at the expense of increased disk space, and, perhaps, longer write times? i.e. looking up a value from the index would only require a single read, rather than two, one for the id(s), and then for the original record.
  2. For the spatial index, what might we get by creating a quadtile index as opposed to kdb tree? My idea would be to create a quad-tile index at a single zoom level (e.g. z16) and respond to boundbox queries approximating the bbox to the overlapping zoom 16 tiles. If we stored this as a materialized view as opposed to an index, bbox quieries could be very fast, since it could be a single batch read from leveldb bounded by the min-max of the tiles we are looking for. Would it matter if we responded to bounding box queries with a close approximation (e.g. rounded to z16 quadtiles) as opposed to the exact bbox boundaries? iD Editor for example only makes bbox requests to the OSM API for bounding boxes equivalent to z16 tiles.
ghost commented 8 years ago

1.

One lookup instead of two is in the class of constant time improvements and would probably result in some very marginal gains, if any. I would skip optimizing this kind of thing we have datasets with measurable performance strain. Leveldb (and IndexedDB using leveldb) is all in-process, so extra lookups are much faster than with a database that uses a remote backend like an external relational database would be. Also, the performance profile of leveldb can be quite counter-intuitive with the aggressive caching it does. One benefit to having separate structures for each index is that we can modify the indexes more easily in the future and.

I think one thing that can be done here to avoid the extra reference lookups is to use a multi-dimensional interval tree for ways and relations and points would live in a kdb-tree. Queries on these two structures would spawn in parallel for any bounding box query.

2.

quadtiles are simple, but have balancing problems and generalizing updates is very complicated. We don't need to do any of this because kdb trees have much better asymptotic guarantees and we don't need to worry about how to build an adaptive bounding volume hierarchy.

I think a much simpler way to accomplish multiple zoom levels would be to skim off certain tags into a separate kdb index that will be used at higher zoom levels. We can pass complicated ways and relations into simplify-geometry. Then instead of generating static tiles, you can quickly generate an ad-hoc view of the data that is relevant to the zoom level.

Tiles are a manifestion of a few different dynamics: the data structures commonly used to store geospatial data such as quadtiles or rtrees, and also the typical read/write profile of a map server. Map services have many orders of magnitude more reads than writes, so it makes sense that they should want to cache all database queries into static files. However, if we have an editor where reads and writes are somewhere near the same order of magnitude, that tradeoff stops making as much sense. If we have an editor running on a mobile device, a whole new tradeoff emerges: minimizing CPU load and background processing to save on battery life. Building static tiles will probably involve lots of extra background processing and cache invalidation and I don't think that will scale very well for low-powered devices.

With dynamic queries using the existing spatial indexes, we can generate the tile formats that different existing viewers expect without needing all the extra processing and complexity of toolchains designed around high read/write ratios for hosted services. If those queries are too slow, we can put them in an LRU cache them which ought to give the same performance gains as tiles without all the additional moving parts.

gmaclennan commented 8 years ago

As ever learning a lot from these answers to my many questions...

Re. simplification, also relevant to https://github.com/digidem/osm-p2p-server/issues/6: presimplification algorithms from d3.topojson and geojson-vt, although for rivers and "natural" lines I always find visvalingam simplification to give better results, implemented in JS in Mapshaper