geotrellis / vectorpipe

Convert Vector data to VectorTiles with GeoTrellis.
https://geotrellis.github.io/vectorpipe/
Other
74 stars 20 forks source link

Improve robustness of changeset lookup #146

Closed jpolchlo closed 3 years ago

jpolchlo commented 3 years ago

Overview

In vectorpipe.sources.ChangesetSource, there are a set of functions for locating a changeset in a replication stream corresponding to a given date. These functions could fail for a variety of reasons, leading to unexplained None.get errors like

20/12/28 20:59:31 ERROR Client: Application diagnostics message: User class threw exception: java.util.NoSuchElementException: None.get
    at scala.None$.get(Option.scala:347)
    at scala.None$.get(Option.scala:345)
    at vectorpipe.sources.ChangesetSource$.estimateSequenceNumber(ChangesetSource.scala:120)
    at vectorpipe.sources.ChangesetSource$.findSequenceFor(ChangesetSource.scala:125)
...

These errors are transient, so hard to test. The fixes taken up in this contribution are to attempt to define better fallback behavior. For instance, when ###.state.txt files are not available for some sequence number (this is true for many sequences in OSM proper), we defer to the associated ###.osm.gz, and attempt to estimate the last_run for that sequence from the contained changesets.

These approximations may lead to substantially worse estimates, and therefore longer runtimes, but processes will not fail suddenly. If there were any indication why functions such as getCurrentSequence fail, we might be able to implement more targeted fixes. This PR represents naïve solutions that will hopefully lower the failure rate.

Testing Instructions

I am open to suggestions for how to test this without setting up a mock replication stream, which may be time consuming to achieve.

Checklist

Closes #143