digidem / osm-p2p-db

Peer-to-peer database for OpenStreetMap data
BSD 2-Clause "Simplified" License
237 stars 25 forks source link

edge cases with forks #29

Open hackergrrl opened 8 years ago

hackergrrl commented 8 years ago

Here are some edge cases regarding ways and forks of them and their nodes that we ought to consider & test for, pulled from @gmaclennan in IRC:

1.

If a point is forked and one fork is inside the query and the other outside, does it just return the one inside? It probably should and probably does.

2.

Here's a reason for some ghost points: when there is a forked way, the query lookup I believe will return the points associated with both forks of the way. But the points arenM- forked, so we return them all, but we choose only one of the forks of the way to return.

3.

Also: if a way is forked and one fork is a delete, does the indexer correctly delete the reverse lookup? We should ensure we have a test for that.

gmaclennan commented 7 years ago

Also: if a way is forked and one fork is a delete, does the indexer correctly delete the reverse lookup? We should ensure we have a test for that.

I think we may be hitting this. I am seeing features appear in Mapeo/iD which do not actually exist in the database when you try to query them. I have a feeling this is because of the following scenario:

  1. Way A is created with points B, C, D
  2. User X edits A by changing an attribute --> A'
  3. User Y deletes A (not A')
  4. X and Y replicate data. I think the lookup code in queryStream may return the deleted way when any of points B, C, D are in the bbox.

Can we write a test for this?

hackergrrl commented 7 years ago

I wrote a test for this in osm-p2p-server (already had the scaffolding there), but what I see is the modified way being returned: https://github.com/digidem/osm-p2p-server/commit/265d2ab7ba251c1366482513e813a3fc1ef1245c

hackergrrl commented 7 years ago

@gmaclennan I remember you describing another case involving overlapping bounding boxes, or similar. Could you describe that?

gmaclennan commented 7 years ago

The bugs in Mapeo definitely seem related to queryStream lookups (from the api/0.6/map endpoint). iD makes queries to this end point for tiles at zoom 16 (regardless of the UI zoom of the map). This end point returns nodes outside the bounding box if they are referenced by ways that have nodes inside the bounding box, is accordance with the spec

What I am guess is happening is that this lookup code (essentially, hyperlog-index) has a bug. My guess would be to look at a forked way like the one described above and do a bounding box query that only includes one of its nodes. We should carefully think through what should actually happen in this scenario, both when we are requesting forks and when we are not. What happens when one of the "forks" is a deleted way?

hackergrrl commented 7 years ago

After much digging, I believe I have an explanation. Consider a way with the following DAG structure in osm-p2p-db:

83b1221b4be413a30478fbb0d5be06395bd0c55d6a760ef4cf564531b10eccc2
  <- b9d60d6eb459b2e0fc693226c847b51995923c46c8fde6a20c3b94ff8579f929
    <- 6117e6d75ddb56985a53f3e116d3b4c0da8f2a231bd861c97d98071f02146759
      <- 86679afd770983ff67c529948ad5a053819ae1199a996d122e5927ba5d8aab10
        <- 495ee12c58537aa44fef7bb872ccf6cfebe8e23b3b490817b2e0d98bd11fccba
          <- e8c20112a059945fe8a3dec28972979b9b488cf62ab43c49556053e9fec4d450
            <- a7accc4532bb53283c7e582eee9558c418591b4c05eb8892ec3f5c233b2c55bf
              D dded60535063f8202e8ab7d549a78ce3ae787e39940e02846c818e2aba679f50
    <- ce2a6bac3aff1587415add7e0028a7da1d7a35832416bcd25306121fbaed0471

Where 83b1221.. is the root, and the others are children, grandchildren, etc. In this tree, many edits happened to the root way, eventually culminating in a deletion of the way (and all of its nodes). However, elsewhere on another machine, another edit took place from an earlier version of the way (ce2a6b..).

This results in the osm-p2p-db being in the following state:

  1. osm.get(osm_id_of_the_way) => ce2a6bac3aff1587415..
  2. osm.get(a_node_in_that_way) => null

What's happening is, after these forks exist in the same osm-p2p-db, the database considers the non-deleted way as real, but still thinks all of its nodes are deleted.

gmaclennan commented 7 years ago

Yes, good find. When an element is deleted via the API we check whether it is referred to by any existing elements: https://github.com/digidem/osm-p2p-server/blob/master/lib/filter_deletes.js

Of course, as you point out, it could be referred to by forked elements on other peers.

I am not entirely sure about the best way to deal with this. I think if a way exists that refers to deleted nodes, we should return those nodes in the way.

I think osm.get() should have a way of distinguishing between a node that does not exist and a node that has been deleted. The OSM API uses an attribute visible=false for elements that have been deleted (http://wiki.openstreetmap.org/wiki/Elements#Common_attributes). We should perhaps use a similar convention, e.g. if an element exists but has been deleted osm.get() should return the element but with the property visible: false.

I think this is needed anyway for the filtering of forks to work correctly e.g. two forks of a point exist, and in Mapeo you would see the most recently edited. If you delete that fork, the de-forking code will return the non-deleted fork on the next request, so to the user it looks like the delete action failed.

hackergrrl commented 7 years ago

I think the core problem is that ways refer to its nodes by their OsmID rather than their OsmVersion.

OsmID is a vague notion of "the latest version", which means if we don't select the latest version of every element we return (in the non-forking case), we'll always get edge cases with inconsistencies, even with the approaches you've mentioned.

If we had ways and relations refer to specific versions of OSM documents, we would have greater consistency. We could then choose ANY version of any way, and see what nodes it has, regardless of whether they were deleted in some other reality.

I liken our current model to git, except we have commits that refer to filenames instead of specific file version hashes: you couldn't possibly expect reasonable results.

Following that analogy, maybe that's how we ought to model our data: where each changeset is essentially an atomic commit, and you can only choose which specific changeset you want to treat as your HEAD. This means we inherent all of the consistency and soundness of git's model[1].

[1] But perhaps also the merge conflict pains as well.

gmaclennan commented 7 years ago

Yes, the issue of ways and relations referring to nodes/members only by OsmID is a big problem, not just for osm-p2p but for anybody dealing with historic OSM data. I know the Mapbox data team has hit this in their need to review data changes. The workaround they use I think is to use the timestamps to reconstruct which version of nodes/members were referenced by a particular way/relation. This is obviously fragile and costly, especially in a p2p system.

I think we can add versions to the way/relations in a way that remains compatible with existing clients:

  1. Internally we should store both id and version of nodes within a way and members within a relation.
  2. We should prefer version when doing a lookup in the index, but fallback to id.
  3. When a non-p2p-aware client submits a change, use the version of the nodes / members from the changeset if present in the changeset, if not set the version by selecting the most recent fork using the same algorithm that would have been used to present the most recent fork to the client.

Some issues we would hit:

  1. iD editor does not include any unchanged nodes in a changeset if you change the tags on a way.
  2. If you move a node in a way, the way itself is not included in the changeset.

The version number of a way would need to change every time a node was changed. iD might need a patch to ensure it pulls down the updated way after only a node was updated.

For relations, this would make it hard to avoid forks: if we store version numbers on relations, any update to a member would need to update the version of the relation. This would mean we would not be able to use relations for long rivers to avoid forks - any edit to any segment of the river would create a new version of the relation.

gmaclennan commented 7 years ago

I think treating changesets as commits introduces new problems. This would mean we would have forks at a higher level (changesets) which would be more difficult to resolve.

gmaclennan commented 7 years ago

See also: https://github.com/digidem/osm-p2p-db/issues/21