azavea / osmesa

OSMesa is an OpenStreetMap processing stack based on GeoTrellis and Apache Spark
Apache License 2.0
79 stars 26 forks source link

Reduce temporal and spatial resolution of FacetedEditHistogram tiles #150

Closed jpolchlo closed 4 years ago

jpolchlo commented 5 years ago

In order to facilitate generation of edit histogram tiles for large data sources (i.e., full OSM planet history), it is necessary to cut down the volume of data encoded in the edit histogram tiles. For the purposes of this initial exercise, I've implemented an imperfect binning strategy based on clustering into bins that are larger in the past and grow smaller as time goes forward.

image

The above graph gives the rough size of the bins in use as a function over days since the launch of OSM (taken as August 9, 2004). Roughly month-long bins are used initially, down to about 2 at the present day. This means that the ensuing ~5400 days are reduced to less than 1100 bins—about an 80% reduction.

[Aside: the mapping is imperfect because order of days is not necessarily preserved by the resulting bin order.
image The abscissa is the input day, and the ordinate is the day of the bin it was placed in. Here, you can see that there is an inversion in the middle of the range. This is not calamitous for a coarse visualization such as this, but it is worth noting. (It also creates two bins where there should be just one.)]

I've also reduced the spatial resolution by half in each spatial dimension.

The result is a task that required more than 128 m3.xlarge nodes to run to completion. I succeeded with 192 nodes, though fewer may be sufficient. Too small clusters led to OOM errors.

On the output side, the resulting tiles are ... not small. I've found some tiles over 6MB in size in the result set. Clearly, a rethink of how to represent data, especially for inputs with a long, complex edit history is required.

jpolchlo commented 4 years ago

This experiment shows that OSM is going to pose a problem, and that the simple strategy of rebinning isn't going to suffice. We'll need to come up with something better.