earthstar-project / earthstar

Storage for private, distributed, offline-first applications.
https://earthstar-project.org
GNU Lesser General Public License v3.0
633 stars 20 forks source link

Deleting more metadata from docs -- a half-baked idea #75

Open cinnamon-bun opened 3 years ago

cinnamon-bun commented 3 years ago

Note: I'm not recommending this because it would increase the code complexity significantly. It needs more thought.

What's the problem you want solved?

Deleting a doc is done by just overwriting a doc with empty content. This leaves behind metadata:

{
  "author": "@suzy.bjzee56v2hd6mv5r5ar3xqg3x3oyugf7fejpxnvgquxcubov4rntq",
  "content": "",   <----- now an empty string
  "contentHash": "bt3u7gxpvbrsztsm4ndq3ffwlrtnwgtrctlq4352onab2oys56vhq",
  "format": "es.4",
  "path": "/wiki/shared/Flowers",
  "signature": "bjljalsg2mulkut56anrteaejvrrtnjlrwfvswiqsi2psero22qqw7am34z3u3xcw7nx6mha42isfuzae5xda3armky5clrqrewrhgca",
  "timestamp": 1597026338596000,
  "workspace": "+gardening.friends",
}

The worst metadata is author, timestamp, and path. Taken together those might reveal a lot about your activity on Earthstar.

Is there a solution you'd like to recommend?

Easiest answer

First of all, whenever possible, app developers should use meaningless paths such as UUIDs instead of, for example, using a wiki page title as a path.

Fancy answer

The trouble is that we pretty much us the combination of (path, author, timestamp) as a primary key to identify documents so it's hard to redact that information. And we need the author in order to have a signature.

We could, however, hash the path in tombstones to obscure it.

Add a new document format called a Tombstone which looks like this:

{
  "author": "@suzy.bjzee56v2hd6mv5r5ar3xqg3x3oyugf7fejpxnvgquxcubov4rntq",
  "format": "es.4",
  "pathHash": "oefinhalowei8helf231dh",
  "signature": "bjljalsg2mulkut56anrteaejvrrtnjlrwfvswiqsi2psero22qqw7am34z3u3xcw7nx6mha42isfuzae5xda3armky5clrqrewrhgca",
  "timestamp": 1597026338596000,
  "workspace": "+gardening.friends",
}

We're storing the hash of the path instead of the path itself. This document uses the regular rules to overwrite the original based on highest timestamp, but is matched with the original taking the path hashing into account.

The rules are also changed so that it can't overwrite other author's documents, so the timestamp is now free to be set to anything between the original document's timestamp and now without risk of messing up a more recent document from another author.

Can we get rid of timestamps?

I don't think so...

If you get 3 tombstones for the same path and author, but different timestamps, you only want to keep the latest one or you'll accumulate junk. And if a new document is written on top of the tombstone, you want the new document to win.

For those 2 reasons, tombstones need timestamps that are compatible with document timestamps -- tombstones have to be strictly ordered with themselves, and strictly ordered with regular documents (for same author, same path). So I think we need to use regular timestamps on tombstones.

Can we get rid of the author?

I don't think so, I think we need it to verify the document signature.

Benefits

Deleted documents no longer have meaningful paths.

Downsides

Deleted documents still have an author and timestamp. At least the timestamp has a wider possible range it could be randomly set to when deleting (any time between the original document and now).

If you can guess a path, you can hash your guessed path and look for a tombstone, letting you know if the document ever existed. We may need to salt the hashes, but I'm not sure with what. All this logic needs to work on pubs that have no special information or secret keys, so what can we salt it with?

"Sync queries" (selective sync) will no longer work since they can't search for tombstones by path. Syncing will have to always send every tombstone as well as the regular documents that match the sync query.

Code complexity

There would now be 2 types of documents with different keys. Makes SQL storage more complicated.

Tombstones wouldn't show up in regular queries but sometimes we need them to (for syncing especially) so there would have to be a new query option includeTombstones. It would include ALL tombstones, you couldn't filter them by path.

The logic for ingesting a document would get a special case to handle incoming tombstones -- they should delete the targeted document in a slightly different way than a regular document overwrite.

I think regular document lookups and queries could ignore tombstones so those wouldn't need any extra DB lookups.

garbados commented 2 years ago

If you can guess a path, you can hash your guessed path and look for a tombstone

Would it make sense to encrypt the hash using the author's keys? That would make it unguessable but still verifiable.