Open mojodna opened 6 years ago
This functionality could be used for
Either way this would be servicing ephemeral batch jobs, where the code is most likely not doing more than one update to the DataFrame.
I made some preliminary progress on this by adapting the streaming replication sources so that they could be used as non-streaming sources. No branch yet.
This should be a standalone tool. #105 includes the necessary plumbing to make it work and #119 is necessary to make it work reliably.
As discussed in #25, we should create a tool that checks the replication sequence number from the ORC file (which is a TBD; it should be in the OSM PBF metadata but isn't currently copied to the ORC user metadata by
osm2orc
), fetches all OsmChange diffs (ideally minutely to capture incremental changes) fromplanet.osm.org
(or an S3 mirror, also TBD), and applies them to the ORC file as additional rows, writing the result back out to S3 (probably partitioned, probably unsorted since that makes the task faster and the output is not intended to be downloaded / used for purposes that assume sorting).