azavea / osmesa

OSMesa is an OpenStreetMap processing stack based on GeoTrellis and Apache Spark
Apache License 2.0
80 stars 26 forks source link

Analytic VectorTiles from OSM History #28

Open echeipesh opened 6 years ago

echeipesh commented 6 years ago

Generate a layer of Analytic VectorTiles with OSM geometries from OSM history in order to produce them with accurate create, modified timestamp metadata.

Generating Features with Metadata:

In OSM data model only Nodes contain spatial data and they may be referred to by Ways to define, lines, boundaries and polygons. Relations may refer to any of the Nodes, Ways or other Relations. Depending on tags information some Relations have geographic meaning (ex: defining multi polygon with its holes) and others may have only semantic/labelling meaning.

When updates to OSM happen new records with updated changeset/version information are introduced but they retain previous ID and other records referring to changed record are not updated.

osm connected component

For instance it is common to change location of one Node in a Way, which is part of Relation representing multi polygon without updating either the Way or the Relation. This change of constituent point needs to be propagated as a change to the geographic feature.

Proposed Approach

The proposed approach to dealing with this problem is to separate the task of grouping all related OSM records (records are related if one references another by its ID field) and the task of converting the the grouped records into geometric feature with some metadata.

Base assumption is that each group will from a directionally connected component that can fit in memory. The second stage is converting this connected component to a geographic feature with some metadata. Because the rules of generating features and then their metadata are varied, and ultimately application dependent keeping them separate from the shuffle logic has organizational benefits.

Here are some true things these connected components

Note: If the in-memory assumption does not hold we will have to define condition on how or when to duplicate records when they are shared by more than one graph.

Notes

Current version of vectorpipe employs a similar strategy in its osm module but runs into performance problems. It appears that performing the joins in DataFrame, rather than RDD, context addresses these concerns likely because it removes the need for intermediate, per-row, object allocation when performing the joins but rather maintains an Array of primitive values.

As part of this task we should pursued the DataFrame approach and add resulting work to vector pipe package.

Sub tasks:

jenningsanderson commented 6 years ago

Yes! This is a hard problem that needs to be addressed! :wave: I'm currently doing my dissertation around some of these problems, would love to chat more.

If helpful to your thinking about this, here is some ongoing work on incorporating historical data into the current osm-qa-tile schema that has proved itself: https://github.com/mapbox/osm-wayback#historical-feature-schema-for-tags

Also, for the geometry versioning (the hardest part of all this), I have a proof-of-concept around this idea. This is based on a separate geom_version versioning system for objects whose underlying nodes change geometries between their major version numbers. Each object then has a created_at and updated_at timestamp that power MapboxGL queries to render the geometry at that point in time. (Note, you'll need to slide the red marker on the timeline back to mid 2017 to populate the map on the right). In this case, each unique version, geom_version of an object is it's own feature in a vector tile.

echeipesh commented 6 years ago

Hey, thanks for reaching out, osm-wayback looks really cool, pretty close to what we're trying to do here. Is the plan to do this for all of OSM as well?

I'd also be curious to hear more about geom_version scheme. Is it just the sum of all the versions of participating records? The ambiguous part I imagine is version log like this:

(node:0, node:0, way:0) (node:1, node:0, way:0) (node:2, node:0, way:0) (node:2, node:0, way:1) (node:2, node:1, way:1)

I'd love to chat more, you can find "us" and me on gitter here: https://gitter.im/geotrellis/geotrellis or we can setup a hangout at some point.

mojodna commented 6 years ago

@jenningsanderson some sample tiles from a prototype vector tile generation process (now incorporated into OSMesa in a similar form) are here:

https://mojodna-temp.s3.amazonaws.com/rhode-island-all/{z}/{z}/{y}.mvt (zooms 12-15 covering Rhode Island, but with identical data for each)

Here's a viewer (with styling inspired by Alan's Every Line Ever, Every Point Ever): https://bl.ocks.org/mojodna/c499a2352993321c1515b6e61de4fc6d

Minor versions of geometries are present in this schema (w/ minorVersion), triggered by each changeset in which the resulting geometry would have changed. updatedAt is the timestamp at which the geometry changed; validUntil is the timestamp it was superseded by a new geometry (if empty, it's currently valid).

kamicut commented 6 years ago

@mojodna The example has stopped working because tangram is no longer served from the Mapzen URL. You can find it at nextzen: https://www.nextzen.org/tangram/0.14/tangram.min.js

jenningsanderson commented 6 years ago

@mojodna - this sounds incredible - I'd love to see what it looks like in the viewer, but the b.locks is broken because of the missing tangram library - can you update that when you get a chance? Thanks!

kamicut commented 6 years ago

@jenningsanderson I forked the block and updated the library here so that you can view it: https://bl.ocks.org/kamicut/a38118fdc8845e6660952726f24dc4e2

mojodna commented 6 years ago

@kamicut thanks for covering me while I was on vacation!

I've updated the original gist, so the link ^^ should work again in a bit.

mojodna commented 6 years ago

While talking with @jenningsanderson, @lossyrob, and @bhousel this morning, we discussed partitioning vector tile outputs by year (including data that overlaps the beginning or end of a year) and storing it for long-term use; this would shrink the size of individual tiles and allow us to only update tiles for the current year once the schema stops evolving.

@jenningsanderson also mentioned that the existing QA tiles are not buffered (and that that hasn't caused any problems).

kamicut commented 6 years ago

and allow us to only update tiles for the current year once the schema stops evolving

Unsure what this means, is updating a difficult task or does partitioning allow for more use cases?

jenningsanderson commented 6 years ago

@kamicut, I see use cases for both: for analysis-specific tiles where rendering isn't important, keeping data all together makes sense (especially if doing a tile-reduce analysis); for rendering, however, tileset sizes can be dramatically reduced by partitioning by year... as well as time for creation. The idea is that a historical tileset will never change, it really only ever needs to be generated once. Using these tiles to render the map 'at any point in time' then requires loading the proper layer / tileset for the requested year -- just another way to keep tilesets build for rendering lightweight.