digidem / osm-p2p-db

Peer-to-peer database for OpenStreetMap data
BSD 2-Clause "Simplified" License
237 stars 25 forks source link

Investigate indexing performance #55

Open hackergrrl opened 7 years ago

hackergrrl commented 7 years ago

I'd really like to see faster indexing. Multiple people have brought up how slow it can be. Reminder to myself to do some profiling on the indexing pipeline and find the bottlenecks.

ralphtheninja commented 7 years ago

@noffle I sometimes assign to myself (I don't like to be assigned stuff but assigning myself is ok ;) and then I can go and check Issues and hit the Assigned tab https://github.com/issues/assigned

hackergrrl commented 7 years ago

I forgot about that. Thanks @ralphtheninja! :heart:

hackergrrl commented 7 years ago

Indexing time is bottlenecked by insertions to kdb-tree-store. I tried plugging in memory-chunk-store, which didn't make a non-trivial difference to performance.

It looks like reading out and parsing the point data from the chunk-store is where the most time is going. As true today as it was in 2015!)

hackergrrl commented 7 years ago

This is a pretty long response! I was thinking about it as a draft for some kind of technical post for the Dd blog.


Big 'ol update!

A while ago I spent a few days understanding and trying to optimize kdb-tree-store, which was a major bottleneck in how osm-p2p-db turns its raw data into a fast, queryable geo-spatial index. Here's that PR: https://github.com/peermaps/kdb-tree-store/pull/5

Lo and behold: it's about 3-4x faster at insertions now, and the resulting gain to osm-p2p-db indexing is that it's about 1.7x faster (or, indexes take 57% the time to generate). This is great: my Lenovo Thinkpad with a spinning hard drive went down from 12 minutes of indexing on a big dataset down to 7.4 minutes! Noticably faster, even on my newer Dell XPS 13, which is down from 76 seconds to 43 seconds.

Energized by these time savings and by the research I had done on other ways of storing geo-spatial data, I sought to experiment with geohashing and one of @gmaclennan's ideas of a fixed-grid point store. I felt good that I could make even faster geo point stores!

First was geohash-point-store, which is very fast at insertions, but can be easily an order of magnitude slower on queries, depending on the size of the query region. I also realized after I wrote it that there's a caveat where I'd actually need to store multiple points at a geohash address potentially, which necessitates an extra db.get() retrieval per-insert. This adds enough overhead that it wasn't significantly better than my kdb-tree-store improvements. Still, I had high hopes for that fixed grid idea!

For referencing map data, OSM divides the world into 19 layers of fixed-size grids. At zoom level 1, the whole world is a single tile. At zoom level 2, the world is 4 tiles. Each level multiplies the # of tiles by 4. With this, enter grid-point-store. It's a point store that you provide a zoom level to, and it stores points on that fixed grid at whatever zoom. It uses LevelDB underneath, which can do ridiculously fast range queries over sequential keys. Armed with this knowledge, the module converts numeric lat/lon coordinates to lexicographically sortable strings, like fG8v,8ewL. That codifies a single tile at whatever zoom level the point store was created with. The power here is that the tile to its right, fG8v,8ewM (note that M being the difference between the two) comes after the former key in an lexicographic sort. This means I could do a LevelDB query from, say, fG8v,8ewL to fG8v,8eZZ and get every point in that row of tiles starting from fG8v,8ewL to the right-most edge of the world map. With LevelDB-quickness! This means satisfying a 2D bounding box query of a region can be done by making range queries on horizontal strips of tiles, from the top of the bounding box to the bottom.

Net result: grid-point-store blows everything else away! geohash-point-store was faster than kdb-tree-store, but grid-point-store is almost 5 times faster than that! It also has a nicer speed curve as query size goes up compared to the geohashing approach.

Naturally I expected this to mean that osm-p2p-db indexing would become at least 5 times faster than before. Ha, not so! In fact, there was no speed increase with grid-point-store compared to my optimized version of kdb-tree-store. After doing some profiling with the very handy built-in node --prof test.js && node --prof-process isolate-XXXXX* > profile-data.txt, it became clear that there were other bottlenecks: namely, the hyperlog indexing machinery itself. It turns out after I had made kdb-tree-store faster, that moved the speed bottleneck elsewhere, making any further speed gains not affect the overall time spent.

This could be seen as a bit of a bummer, but I'm feeling really good about my plans to move osm-p2p-db to be backed by hyperdb in the near-ish future, which has some really great performance characteristics. Once we do that, the indexing machinery's bottleneck should be gone, which will let us tap into that sweet sweet speedy geo-indexing power lying in wait in grid-point-store.

All in all time really well spent: I learned a ton about geo math, kdb trees, and node profiling, which is going to serve me really well in the future.

Thanks for reading! :wave: :boat: :palm_tree:

ralphtheninja commented 7 years ago

@noffle It would be nice with a writeup on profiling in itself, apart from the other technical details.

hackergrrl commented 7 years ago

profiling

@ralphtheninja For me, profiling was the easy part!

At first I spent some time fudging with flame graphs. On Linux I installed perf and used this script to generate flame graph SVGs:

#!/usr/bin/env bash

perf record -e cycles:u -g -- node --perf-basic-prof $1
perf script > perf-script.txt
cat perf-script.txt | flamegraph -t perf > perf-graph.svg
chromium perf-graph.svg

Which makes nice little visual representations of where your CPU time is going. I found it kind of complicated and onerous to use, which led me to learning that node (via v8) has a super easy built-in profiling mechanism, which I use via this script:

#!/usr/bin/env bash

node --prof $1
node --prof-process isolate* > profile.txt

This text document gives you a breakdown of where ticks (I think libuv ticks) are going. You'll see something like

 [Bottom up (heavy) profile]:
  Note: percentage shows a share of a particular caller in the total
  amount of its parent calls.
  Callers occupying less than 2.0% are not shown.

   ticks parent  name
     36   80.0%  /nix/store/7jpyhg5kcqq84hi5xl8g2wkxis0dgs9a-nodejs-6.9.5/bin/node
     32   88.9%    /nix/store/7jpyhg5kcqq84hi5xl8g2wkxis0dgs9a-nodejs-6.9.5/bin/node
     10   31.3%      LazyCompile: ~runInThisContext bootstrap_node.js:403:28
     10  100.0%        LazyCompile: NativeModule.compile bootstrap_node.js:485:44
     10  100.0%          LazyCompile: ~NativeModule.require bootstrap_node.js:419:34
      3   30.0%            LazyCompile: ~startup bootstrap_node.js:12:19
      2   20.0%            Function: ~<anonymous> tty.js:1:11
      2   20.0%            Function: ~<anonymous> module.js:1:11
      1   10.0%            Function: ~<anonymous> util.js:1:11
      1   10.0%            Function: ~<anonymous> stream.js:1:11
      1   10.0%            Function: ~<anonymous> buffer.js:1:11
      3    9.4%      LazyCompile: ~startup bootstrap_node.js:12:19
      3  100.0%        Function: ~<anonymous> bootstrap_node.js:10:10
      1    3.1%      LazyCompile: ~setup_hrtime internal/process.js:77:22
      1  100.0%        LazyCompile: ~startup bootstrap_node.js:12:19
      1  100.0%          Function: ~<anonymous> bootstrap_node.js:10:10
      1    3.1%      LazyCompile: ~replace native string.js:132:23
      1  100.0%        LazyCompile: ~setupConfig internal/process.js:102:21
      1  100.0%          LazyCompile: ~startup bootstrap_node.js:12:19
      1  100.0%            Function: ~<anonymous> bootstrap_node.js:10:10

which goes on and on. You can pretty quickly identify bottlenecks. There is a top-most section for each native dependency. node here for JS, but also per shared library (e.g. pthread).

My usual fallback is nothing more than console.time statements around areas that I suspect are going to be heavy and slow, and measure the change in time as I tweak things.

ralphtheninja commented 7 years ago

Supernice. Thanks :grinning:

My usual fallback is nothing more than console.time statements around areas that I suspect are going to be heavy and slow, and measure the change in time as I tweak things.

And I guess not putting console.time in hot code paths to avoid messing up the measure. Not sure how much effect this has in JS, but I usually had to think about this when measuring performance in c++ code.

hackergrrl commented 7 years ago

And I guess not putting console.time in hot code paths to avoid messing up the measure.

Right on! I usually do something like

var accum = 0
var d
for (var i=0; i < 10000; i++) {
  d = Date.now() 
  hotFunc()
  accum += Date.now() - d
}
console.log('overall time', accum, 'ms')

so that writing to stdout isn't what dominates.