Open hackergrrl opened 8 years ago
Also: if a way is forked and one fork is a delete, does the indexer correctly delete the reverse lookup? We should ensure we have a test for that.
I think we may be hitting this. I am seeing features appear in Mapeo/iD which do not actually exist in the database when you try to query them. I have a feeling this is because of the following scenario:
Can we write a test for this?
I wrote a test for this in osm-p2p-server (already had the scaffolding there), but what I see is the modified way being returned: https://github.com/digidem/osm-p2p-server/commit/265d2ab7ba251c1366482513e813a3fc1ef1245c
@gmaclennan I remember you describing another case involving overlapping bounding boxes, or similar. Could you describe that?
The bugs in Mapeo definitely seem related to queryStream lookups (from the api/0.6/map
endpoint). iD makes queries to this end point for tiles at zoom 16 (regardless of the UI zoom of the map). This end point returns nodes outside the bounding box if they are referenced by ways that have nodes inside the bounding box, is accordance with the spec
What I am guess is happening is that this lookup code (essentially, hyperlog-index) has a bug. My guess would be to look at a forked way like the one described above and do a bounding box query that only includes one of its nodes. We should carefully think through what should actually happen in this scenario, both when we are requesting forks and when we are not. What happens when one of the "forks" is a deleted way?
After much digging, I believe I have an explanation. Consider a way with the following DAG structure in osm-p2p-db:
83b1221b4be413a30478fbb0d5be06395bd0c55d6a760ef4cf564531b10eccc2
<- b9d60d6eb459b2e0fc693226c847b51995923c46c8fde6a20c3b94ff8579f929
<- 6117e6d75ddb56985a53f3e116d3b4c0da8f2a231bd861c97d98071f02146759
<- 86679afd770983ff67c529948ad5a053819ae1199a996d122e5927ba5d8aab10
<- 495ee12c58537aa44fef7bb872ccf6cfebe8e23b3b490817b2e0d98bd11fccba
<- e8c20112a059945fe8a3dec28972979b9b488cf62ab43c49556053e9fec4d450
<- a7accc4532bb53283c7e582eee9558c418591b4c05eb8892ec3f5c233b2c55bf
D dded60535063f8202e8ab7d549a78ce3ae787e39940e02846c818e2aba679f50
<- ce2a6bac3aff1587415add7e0028a7da1d7a35832416bcd25306121fbaed0471
Where 83b1221..
is the root, and the others are children, grandchildren, etc. In this tree, many edits happened to the root way, eventually culminating in a deletion of the way (and all of its nodes). However, elsewhere on another machine, another edit took place from an earlier version of the way (ce2a6b..
).
This results in the osm-p2p-db being in the following state:
osm.get(osm_id_of_the_way) => ce2a6bac3aff1587415..
osm.get(a_node_in_that_way) => null
What's happening is, after these forks exist in the same osm-p2p-db, the database considers the non-deleted way as real, but still thinks all of its nodes are deleted.
Yes, good find. When an element is deleted via the API we check whether it is referred to by any existing elements: https://github.com/digidem/osm-p2p-server/blob/master/lib/filter_deletes.js
Of course, as you point out, it could be referred to by forked elements on other peers.
I am not entirely sure about the best way to deal with this. I think if a way exists that refers to deleted nodes, we should return those nodes in the way.
I think osm.get()
should have a way of distinguishing between a node that does not exist and a node that has been deleted. The OSM API uses an attribute visible=false
for elements that have been deleted (http://wiki.openstreetmap.org/wiki/Elements#Common_attributes). We should perhaps use a similar convention, e.g. if an element exists but has been deleted osm.get()
should return the element but with the property visible: false
.
I think this is needed anyway for the filtering of forks to work correctly e.g. two forks of a point exist, and in Mapeo you would see the most recently edited. If you delete that fork, the de-forking code will return the non-deleted fork on the next request, so to the user it looks like the delete action failed.
I think the core problem is that ways refer to its nodes by their OsmID rather than their OsmVersion.
OsmID is a vague notion of "the latest version", which means if we don't select the latest version of every element we return (in the non-forking case), we'll always get edge cases with inconsistencies, even with the approaches you've mentioned.
If we had ways and relations refer to specific versions of OSM documents, we would have greater consistency. We could then choose ANY version of any way, and see what nodes it has, regardless of whether they were deleted in some other reality.
I liken our current model to git, except we have commits that refer to filenames instead of specific file version hashes: you couldn't possibly expect reasonable results.
Following that analogy, maybe that's how we ought to model our data: where each changeset is essentially an atomic commit, and you can only choose which specific changeset you want to treat as your HEAD. This means we inherent all of the consistency and soundness of git's model[1].
[1] But perhaps also the merge conflict pains as well.
Yes, the issue of ways and relations referring to nodes/members only by OsmID is a big problem, not just for osm-p2p but for anybody dealing with historic OSM data. I know the Mapbox data team has hit this in their need to review data changes. The workaround they use I think is to use the timestamps to reconstruct which version of nodes/members were referenced by a particular way/relation. This is obviously fragile and costly, especially in a p2p system.
I think we can add versions to the way/relations in a way that remains compatible with existing clients:
Some issues we would hit:
The version number of a way would need to change every time a node was changed. iD might need a patch to ensure it pulls down the updated way after only a node was updated.
For relations, this would make it hard to avoid forks: if we store version numbers on relations, any update to a member would need to update the version of the relation. This would mean we would not be able to use relations for long rivers to avoid forks - any edit to any segment of the river would create a new version of the relation.
I think treating changesets as commits introduces new problems. This would mean we would have forks at a higher level (changesets) which would be more difficult to resolve.
Here are some edge cases regarding ways and forks of them and their nodes that we ought to consider & test for, pulled from @gmaclennan in IRC:
1.
2.
3.