How to handle deletions of documents referenced by others in a distributed system?

gmaclennan commented 7 years ago

The OSM API does a check when you attempt to delete a node/way/relation to ensure that it is not referenced by any other documents, and if it is referenced it returns a 412 error: http://wiki.openstreetmap.org/wiki/API_v0.6#Delete:_DELETE_.2Fapi.2F0.6.2F.5Bnode.7Cway.7Crelation.5D.2F.23id

The changeset upload API has an option if-unused that silently ignores deletions of elements used by others: http://wiki.openstreetmap.org/wiki/API_v0.6#Diff_upload:_POST_.2Fapi.2F0.6.2Fchangeset.2F.23id.2Fupload

In a p2p distributed system a document could be unused on the current peer, but in-use on another peer.

E.g. Clients α and β have a way A with nodes B,C,D. Client α deletes the way, and deletes the nodes B,C,D since they are not used by any document in the db on α. Client β modifies a tag on the way, but does not modify any of the nodes B,C,D. After replication way A is forked, but nodes B,C,D have a single head which is a deletion. The undeleted fork of A now points to deleted nodes.

gmaclennan commented 7 years ago

One way (only way?) to solve this is by storing versions on document references e.g.

{
  "type": "way",
  "id": 1,
  "nodes": [{
    "id": 10,
    "version": "abcde"
  }, {
    "id": 11,
    "version": "fghijk"
  }, {
    "id": 12,
    "version": "lmnopq"
  }],
  "tags": {
    "highway": "tertiary",
    "name": "Main Street"
  }
}

We could only store versions if we like, since in osm-p2p-db versions are unique. This would mean that deletions could always happen without an "if-unused" check, since the deletion doc would be a new version, and therefore not used by any other document.

A side-effect of this given the case of the forked/deleted way about is that the non-deleted fork would be referencing nodes which are deleted at HEAD, how would they be surfaced in the kdb query and querystream()?

gmaclennan commented 7 years ago

Our current proposed way of surfacing deletions keeps deleted nodes in the kdb index for the locations of all the deleted node's parents. For deleted nodes returned from the kdb index we would need to find any ways that reference the id, rather than reference the version, and then follow that chain to get the nodes referenced by those ways based on the version of that reference. This might require an additional lookup if we wanted to be strict about what appears in the bbox e.g.

in the case above, the node B is not forked, and not modified by the deletion or modification of the way. If bbox query included where B was before it was deleted, the kdb index should return the deleted record. We would then need to lookup in the join index to see which ways reference B (the id B not the version of the deleted record). We would then get the two forks of the way A - one deleted and one not - and we would need to lookup the nodes referred to by these ways by their version (not their id). There is an edge-case though where one of the forks of the way may contain nodes which are all outside of the bbox. We would need to do a bbox check on each of these lookups if we wanted to keep the API strict.

hackergrrl commented 7 years ago

A side-effect of this given the case of the forked/deleted way about is that the non-deleted fork would be referencing nodes which are deleted at HEAD, how would they be surfaced in the kdb query and querystream()?

I wonder if the following is a good predicate for deciding if a node is deleted or not:

"A node is considered deleted iff there is a deletion tombstone for it that is a HEAD, and there are only deleted ways on HEAD that refer to it."

Thinking on how this would affect kdb indexing:

if the indexer sees a non-deleted node, it indexes it
if the indexer sees a deleted node, it looks at all ways on HEAD that refer to it. if 1+ are non-deleted, the node is indexed

This would mean that downstream deforking would need to understand this definition of deleted, and apply it on nodes as well to decide which nodes are deleted or not, based on the way fork it chooses.

digidem / osm-p2p-db

How to handle deletions of documents referenced by others in a distributed system? #48