Open echeipesh opened 6 years ago
Yes! This is a hard problem that needs to be addressed! :wave: I'm currently doing my dissertation around some of these problems, would love to chat more.
If helpful to your thinking about this, here is some ongoing work on incorporating historical data into the current osm-qa-tile schema that has proved itself: https://github.com/mapbox/osm-wayback#historical-feature-schema-for-tags
Also, for the geometry versioning (the hardest part of all this), I have a proof-of-concept around this idea. This is based on a separate geom_version
versioning system for objects whose underlying nodes change geometries between their major version
numbers. Each object then has a created_at
and updated_at
timestamp that power MapboxGL queries to render the geometry at that point in time. (Note, you'll need to slide the red marker on the timeline back to mid 2017 to populate the map on the right). In this case, each unique version, geom_version
of an object is it's own feature in a vector tile.
Hey, thanks for reaching out, osm-wayback
looks really cool, pretty close to what we're trying to do here. Is the plan to do this for all of OSM as well?
I'd also be curious to hear more about geom_version
scheme. Is it just the sum of all the versions of participating records? The ambiguous part I imagine is version log like this:
(node:0, node:0, way:0)
(node:1, node:0, way:0)
(node:2, node:0, way:0)
(node:2, node:0, way:1)
(node:2, node:1, way:1)
I'd love to chat more, you can find "us" and me on gitter here: https://gitter.im/geotrellis/geotrellis or we can setup a hangout at some point.
@jenningsanderson some sample tiles from a prototype vector tile generation process (now incorporated into OSMesa in a similar form) are here:
https://mojodna-temp.s3.amazonaws.com/rhode-island-all/{z}/{z}/{y}.mvt (zooms 12-15 covering Rhode Island, but with identical data for each)
Here's a viewer (with styling inspired by Alan's Every Line Ever, Every Point Ever): https://bl.ocks.org/mojodna/c499a2352993321c1515b6e61de4fc6d
Minor versions of geometries are present in this schema (w/ minorVersion
), triggered by each changeset in which the resulting geometry would have changed. updatedAt
is the timestamp at which the geometry changed; validUntil
is the timestamp it was superseded by a new geometry (if empty, it's currently valid).
@mojodna The example has stopped working because tangram is no longer served from the Mapzen URL. You can find it at nextzen: https://www.nextzen.org/tangram/0.14/tangram.min.js
@mojodna - this sounds incredible - I'd love to see what it looks like in the viewer, but the b.locks is broken because of the missing tangram library - can you update that when you get a chance? Thanks!
@jenningsanderson I forked the block and updated the library here so that you can view it: https://bl.ocks.org/kamicut/a38118fdc8845e6660952726f24dc4e2
@kamicut thanks for covering me while I was on vacation!
I've updated the original gist, so the link ^^ should work again in a bit.
While talking with @jenningsanderson, @lossyrob, and @bhousel this morning, we discussed partitioning vector tile outputs by year (including data that overlaps the beginning or end of a year) and storing it for long-term use; this would shrink the size of individual tiles and allow us to only update tiles for the current year once the schema stops evolving.
@jenningsanderson also mentioned that the existing QA tiles are not buffered (and that that hasn't caused any problems).
and allow us to only update tiles for the current year once the schema stops evolving
Unsure what this means, is updating a difficult task or does partitioning allow for more use cases?
@kamicut, I see use cases for both: for analysis-specific tiles where rendering isn't important, keeping data all together makes sense (especially if doing a tile-reduce analysis); for rendering, however, tileset sizes can be dramatically reduced by partitioning by year... as well as time for creation. The idea is that a historical tileset will never change, it really only ever needs to be generated once. Using these tiles to render the map 'at any point in time' then requires loading the proper layer / tileset for the requested year -- just another way to keep tilesets build for rendering lightweight.
Generate a layer of Analytic VectorTiles with OSM geometries from OSM history in order to produce them with accurate create, modified timestamp metadata.
Generating Features with Metadata:
In OSM data model only Nodes contain spatial data and they may be referred to by Ways to define, lines, boundaries and polygons. Relations may refer to any of the Nodes, Ways or other Relations. Depending on tags information some Relations have geographic meaning (ex: defining multi polygon with its holes) and others may have only semantic/labelling meaning.
When updates to OSM happen new records with updated changeset/version information are introduced but they retain previous ID and other records referring to changed record are not updated.
For instance it is common to change location of one Node in a Way, which is part of Relation representing multi polygon without updating either the Way or the Relation. This change of constituent point needs to be propagated as a change to the geographic feature.
Proposed Approach
The proposed approach to dealing with this problem is to separate the task of grouping all related OSM records (records are related if one references another by its ID field) and the task of converting the the grouped records into geometric feature with some metadata.
Base assumption is that each group will from a directionally connected component that can fit in memory. The second stage is converting this connected component to a geographic feature with some metadata. Because the rules of generating features and then their metadata are varied, and ultimately application dependent keeping them separate from the shuffle logic has organizational benefits.
Here are some true things these connected components
Note: If the in-memory assumption does not hold we will have to define condition on how or when to duplicate records when they are shared by more than one graph.
Notes
Current version of
vectorpipe
employs a similar strategy in itsosm
module but runs into performance problems. It appears that performing the joins in DataFrame, rather than RDD, context addresses these concerns likely because it removes the need for intermediate, per-row, object allocation when performing the joins but rather maintains anArray
of primitive values.As part of this task we should pursued the DataFrame approach and add resulting work to
vector pipe
package.Sub tasks:
Feature[G, VersionMetadata]