azavea / osmesa

OSMesa is an OpenStreetMap processing stack based on GeoTrellis and Apache Spark
Apache License 2.0
79 stars 26 forks source link

Further investigate changeset discrepancies in the augmented diff stream #186

Open CloudNiner opened 4 years ago

CloudNiner commented 4 years ago

In #165 we fixed the total_edit counts for the changesets table via the ChangesetStatsForeachWriter. During testing, we found additional discrepancies between values in the changes db jsonb field for changesets computed via augmented diff versus imported from an OSM export. These should match, and OSM export is treated as the source of truth.

This epic is meant to encapsulate smaller tasks that are responsible for testing the augmented diff stream and the overpass diff publisher to figure out where the discrepancies arise.

The changeset 509539 in production (generated by aug diff):

counts          | {"other_modified": 5920, "roads_modified": 35, "waterways_modified": 8, "coastlines_modified": 3}

vs staging (generated via OSM extract):

counts          | {"other_modified": 5492, "roads_modified": 41, "waterways_modified": 8, "coastlines_modified": 4}

These are different but should be the same. At time of writing this issue, we would assume that the staging one generated via OSM extract is correct.

mojodna commented 4 years ago

Since augmented diffs are published minutely, it’s possible to see multiple edits to the same element within a changeset (especially when using JOSM, since it does intermediate saving) that get folded together when processing a history file. That could explain why the number is higher.

CloudNiner commented 4 years ago

👋

get folded together when processing a history file

At one point you also mentioned "the longer an environment has been processing replication streams, the more it will appear to measure greater distances and larger counts. this is because it has seen more edits split across augmented diffs". You're referring here to the same behavior that you're referring to in your comment above, right?

CloudNiner commented 4 years ago

Some additional progress on this in the meantime. We're noticing that in the case of overpass-diff-publisher, if there are large changesets that overpass needs to publish, that the overpass-diff-publisher will visit an augmented sequence id before overpass has finished processing it. This leads to empty or incomplete sequence ids that are never revisited (even if Overpass eventually describes those changes at a particular sequence id) and thus leads to changes that are never counted in osmesa.

mojodna commented 4 years ago

You're referring here to the same behavior that you're referring to in your comment above, right?

Yes, but this sounds worse.

This also sounds like known Overpass API behavior with a side of chaos: Overpass augmented diffs may (will) change over time (data received in the future may modify what's returned for individual sequences), as you're seeing. It's not 1:1 with OSM minutely diffs.

I was unaware of problems where it would return data even if it wasn't finished processing a diff. That's bad. An idea for a potential workaround: check whether it thinks it's at/past a sequence before requesting the augmented diff for that sequence (assuming that augmented_diff_status updates after the diff has been processed).