Overview

In vectorpipe.sources.ChangesetSource, there are a set of functions for locating a changeset in a replication stream corresponding to a given date. These functions could fail for a variety of reasons, leading to unexplained None.get errors like

20/12/28 20:59:31 ERROR Client: Application diagnostics message: User class threw exception: java.util.NoSuchElementException: None.get
    at scala.None$.get(Option.scala:347)
    at scala.None$.get(Option.scala:345)
    at vectorpipe.sources.ChangesetSource$.estimateSequenceNumber(ChangesetSource.scala:120)
    at vectorpipe.sources.ChangesetSource$.findSequenceFor(ChangesetSource.scala:125)
...

These errors are transient, so hard to test. The fixes taken up in this contribution are to attempt to define better fallback behavior. For instance, when ###.state.txt files are not available for some sequence number (this is true for many sequences in OSM proper), we defer to the associated ###.osm.gz, and attempt to estimate the last_run for that sequence from the contained changesets.

These approximations may lead to substantially worse estimates, and therefore longer runtimes, but processes will not fail suddenly. If there were any indication why functions such as getCurrentSequence fail, we might be able to implement more targeted fixes. This PR represents naïve solutions that will hopefully lower the failure rate.

Testing Instructions

I am open to suggestions for how to test this without setting up a mock replication stream, which may be time consuming to achieve.

Checklist

[X] Add entry to CHANGELOG.md

Closes #143

geotrellis / vectorpipe

Improve robustness of changeset lookup #146

Overview

Testing Instructions

Checklist