earthstar-project / earthstar

Storage for private, distributed, offline-first applications.
https://earthstar-project.org
GNU Lesser General Public License v3.0
634 stars 20 forks source link

How to prevent infinite spread of data across peers? #34

Closed cinnamon-bun closed 2 years ago

cinnamon-bun commented 4 years ago

Background

In SSB, data is limited to spread N hops across the network of peers. This is tracked using the social graph (following).

In Earthstar we can't count the number of hops because we don't have a social graph (no following mechanism, yet). Instead we have two classes of peer: users and pubs. Pubs are unattended peers that are not closely associated with a single user.

Pubs are passive buckets for users to put or get data from. The only way for data to spread from pub to pub is via a user who syncs with both pubs.

Users can also sync directly with each other.

So data zigzags between users and pubs as it spreads:

    pub      pub     pub
   /   \    /   \   /
user    user --- user

Problem

A workspace's data could spread widely across pubs and users, far beyond the people who are actually using it.

There are a few ways a workspace could get onto a new peer, and no ways for workspaces to get forgotten by a peer.

How to limit the spread

Any 2 peers should be allowed to sync a workspace if they both know its address. This problem is all about discovering and adding new workspaces.

The sync protocol could:

Users should:

Pubs could:

Do we need harder rules to limit the spread? E.g. a workspace's data could somehow include an allowlist of pubs that are allowed to host it, and we hope that all peers will respect that list and not spread it further?

sgwilym commented 4 years ago

and we hope that all peers will respect that list and not spread it further?

Would I be right in thinking that without resorting to encryption, all that can be done is nudge client authors and pub operators towards these principles with defaults and docs and hope it works out?

Also, maybe this was an idea implicit in pubs and workspaces being embedded within a single address, but: there's a huge hidden UX benefit to embedding preferred pubs within a workspace: it defers needing to learn about syncing and pubs until the user wants. With good silent syncing behaviour, you could make an app work close to the expectations of a traditional web app.

And yeah, if that was the point, then bravo 😄

cinnamon-bun commented 4 years ago

@sgwilym

Would I be right in thinking that without resorting to encryption, all that can be done is nudge client authors and pub operators towards these principles with defaults and docs and hope it works out?

I think that's right. I couldn't think of a way to enforce this limited spread using encryption, we have to rely on humans doing the right thing. :/

Encryption of data will limit the damage caused by data spreading too far. Right now apps can encrypt the document content however they like but the paths are still exposed. I have ideas about more integrated encryption.


pubs and workspaces being embedded within a single address

Yes! It's too much work to give someone a workspace address AND pub address(es) when you invite them.

Merging one pub with one workspace makes it easy to copy-paste and share:

https://mypub.com/+gardening.xxxxxx

But I really want to have a couple of pubs there. It's important for redundancy / reliability, and to prevent one pub from getting a monopoly over a workspace.

How can we include multiple pubs in this string. We need an "invite format". It should also include a workspace's shared private key for invite-only workspaces (which are not implemented yet)

RangerMauve commented 4 years ago

What about specifying the pubs as args in the querystring portion of the URL?

cinnamon-bun commented 4 years ago

I started a new issue for Invite strings.

RangerMauve commented 4 years ago

Questions: Are peers going to be downloading data for workplaces they don't explicitly know about / are they going to be uploading things they don't explicitly know about to pubs?

cinnamon-bun commented 4 years ago

Neither peers nor pubs should hold data for workspaces they don't explicitly know about. That's the goal, anyway. No accidental workspace hosting.

Peers can choose what they upload. By default it will be "everything from the workspaces I have", but you could narrow it down by setting a sync filter.

For example, maybe you only want to upload documents you authored yourself, or only from people you trust, or not blocked people, etc.

RangerMauve commented 4 years ago

I really like Make it impossible to enumerate the workspaces held by the other side as a starting point. That's what hypercore-protocol does and it leads to some nice guarantees of privacy.

I could see it making data less resilient though if fewer people are downloading it.

One thing this makes me think about is having a separation between pubs and regular peers and what sort of structures that could lead to. Personally, I'd be happy to not need pubs in most situations and have them as a last resort. Do you have thoughts on the matter? I could also see how having lots of pubs and saving data between pubs could be useful.

I like the idea you brought up of pubs hosing data from only users they trust. I'm not sure how friends would work in this case, and whether there could be some sort of allow list for replication that works for both people and pubs in the same way. Like, maybe I only want to replicate with everyone, or an allow list of people. And maybe that same mechanism could exist on a pub?

cinnamon-bun commented 4 years ago

It depends on the use-case we're aiming for. Here's 3 and how they fit my personal goals for Earthstar:

They have different connection needs:

My principle is "keep the infrastructure close to home", e.g. you or your friends control the pubs that you use. Avoid pubs run by strangers.

So


Re: replicating with an allowlist of people. Yes, I'm hoping people will add lots of options to limit syncing in various ways. That's pretty easy to do since Earthstar is so flexible about what documents you sync.

sgwilym commented 2 years ago

I think this has been addressed by peers only being able to sync shares they already know about at the time of initiating sync.