Tool to apply diffs to an ORC file to update it

mojodna commented 6 years ago

As discussed in #25, we should create a tool that checks the replication sequence number from the ORC file (which is a TBD; it should be in the OSM PBF metadata but isn't currently copied to the ORC user metadata by osm2orc), fetches all OsmChange diffs (ideally minutely to capture incremental changes) from planet.osm.org (or an S3 mirror, also TBD), and applies them to the ORC file as additional rows, writing the result back out to S3 (probably partitioned, probably unsorted since that makes the task faster and the output is not intended to be downloaded / used for purposes that assume sorting).

lossyrob commented 6 years ago

This functionality could be used for

Updating a DataFrame to the latest OSM data for running analytics against the most up-to-date data possible
Updating an ORC file in a way that doesn't have to wait for a database dump

Either way this would be servicing ephemeral batch jobs, where the code is most likely not doing more than one update to the DataFrame.

mojodna commented 5 years ago

I made some preliminary progress on this by adapting the streaming replication sources so that they could be used as non-streaming sources. No branch yet.

mojodna commented 5 years ago

This should be a standalone tool. #105 includes the necessary plumbing to make it work and #119 is necessary to make it work reliably.

azavea / osmesa

Tool to apply diffs to an ORC file to update it #48