azavea / osmesa

OSMesa is an OpenStreetMap processing stack based on GeoTrellis and Apache Spark
Apache License 2.0
80 stars 26 forks source link

Tool to apply diffs to an ORC file to update it #48

Open mojodna opened 6 years ago

mojodna commented 6 years ago

As discussed in #25, we should create a tool that checks the replication sequence number from the ORC file (which is a TBD; it should be in the OSM PBF metadata but isn't currently copied to the ORC user metadata by osm2orc), fetches all OsmChange diffs (ideally minutely to capture incremental changes) from planet.osm.org (or an S3 mirror, also TBD), and applies them to the ORC file as additional rows, writing the result back out to S3 (probably partitioned, probably unsorted since that makes the task faster and the output is not intended to be downloaded / used for purposes that assume sorting).

lossyrob commented 6 years ago

This functionality could be used for

Either way this would be servicing ephemeral batch jobs, where the code is most likely not doing more than one update to the DataFrame.

mojodna commented 5 years ago

I made some preliminary progress on this by adapting the streaming replication sources so that they could be used as non-streaming sources. No branch yet.

mojodna commented 5 years ago

This should be a standalone tool. #105 includes the necessary plumbing to make it work and #119 is necessary to make it work reliably.