earthstar-project / earthstar-graphql

Query, sync and set data to earthstar workspaces using GraphQL.
GNU Affero General Public License v3.0
6 stars 1 forks source link

Add docs for syncing in README #3

Closed cinnamon-bun closed 4 years ago

cinnamon-bun commented 4 years ago

The docs don't describe how to sync, specify pub URLs, etc.

(Hm, will an earthstar-graphql instance be able to sync with another earthstar-graphql instance?)

sgwilym commented 4 years ago

Oh right, I'll update the README to say that syncing is triggered through the GraphQL API itself. Once you know that, there are docs in the playground for the sync with all the details.

As for earthstar-graphql instances being able to sync with each other... I've not been sure about what to do here! If that were the case, an instance would be a complete pub with a GraphQL endpoint, right?

I'm a little hesitant about positioning this as something that should be deployed online and publicly accessible (I imagined it being embedded / running locally as the first choice), because it would mean I'd have to focus on hardening the API against certain attacks (e.g. it'd be really easy to make a malicious query complex enough to take the instance down as things stand).

Maybe because of the focus on pubs being personal and non-discoverable, it wouldn't be too much work to make this 'good enough', though?

cinnamon-bun commented 4 years ago

Hmm, yeah, do you mean GraphQL queries can be complex or earthstar queries can be complex? A worst-case earthstar query might have to scan every document but it shouldn't be O(n^2) or anything. But maybe it's hard to make GraphQL safe against DDoS attacks?

More detail about syncing

Earthstar peers can be servers or clients or both. Servers respond to queries, clients request queries.

Syncing is a conversation between a client (who drives the conversation and tries to keep the sync efficient) and a server (who just answers questions). This is asymmetrical to make it fit better in a HTTP paradigm (vs. a duplex stream paradigm like SSB and hypercore).

Servers

Clients

For two servers to sync with each other, one of them has to act as a client -- it needs some extra code to drive the sync conversation. E.g. you'd somehow ask graphQL server A to start a long-running background process that talks as a client to graphQL server B.

Easy, slow sync

Syncing is really basic right now. This happens in earthstar's sync.ts

// client pull
Client: GET all your documents
Server: [doc1, doc2, doc3]

// client push
Client: POST hey, here's all my documents: [doc1, doc2, doc3, doc4]
Server: ok

Efficient sync (not implemented yet)

This also adds the concept of "replication queries", where each side can express what data it wants to have. Maybe a peer only wants wiki documents, or recent documents.

// client pull
Client: GET hashes of all your documents that match my replication query `{pathPrefix: "/chess"}`
Server: [hash1, hash2, hash3]

Client: GET I don't have [hash2, hash3] yet, give me those.
Server: [doc2, doc3]

// client push
Client: GET What do you want?
Server: My replication query is `{pathPrefix: "/wiki"}`

Client: POST I have [hash1, hash2, hash3], what do you need?
Server: I need [hash1, hash2]

Client: POST [doc1, doc2]
Server: ok thanks
sgwilym commented 4 years ago

@cinnamon-bun Regarding complex queries, you can write something like this:

{
workspaces {
  documents {
    authors {
      workspaces {
        documents {
          # you get the idea
        }
      }
    }
  }
}

And unless you have some depth limiting or complexity analysis, the schema will dutifully resolve every item for each level of this query.

sgwilym commented 4 years ago

@cinnamon-bun It's true that earthstar-graphql lets you query for Earthstar data and returns a response, but I think because it doesn't do that within the context of a sync operation, this package could be considered a client with the Earthstar ecosystem?

Even though one of the main exports for this is a HTTP server, the intention for this package is that it's deployed locally and acts as a client's Earthstar 'engine': get me a list of my workspaces, the latest documents, set some data, kick off a sync, etc.

You could easily deploy the HTTP server online, but that seems to take away a lot of the benefits you get from earthstar: clients would need an internet connection to get data for their UIs, it's a single point of failure for many clients, and someone malicious could easily bring it down. And because this server only understands GraphQL queries, it wouldn't make a good pub as clients are expecting this conversation to happen in a certain way.

What do you think?

cinnamon-bun commented 4 years ago

@sgwilym

Yeah, it's a lot of work to harden something for being exposed to the internet!

I think there's two ideas in play here:

  1. A localhost thing like earthstar-graphql could take on either or both roles of "client" or "server" in the sync algorithm, probably "client" would be the obvious choice but it could also play "server" if it wanted to talk to other things on localhost or the local network.
  2. The ability to play the "server" role doesn't have to mean it's a "pub" in the sense of "a server that's safe to expose to the internet".

Anyway 👍 on designing this project for localhost if that's what you want to do. It sounds like HTTP would be a better choice than GraphQL for internet-hardened "pubs".

sgwilym commented 4 years ago

@cinnamon-bun Well
 I think I’m changing my mind on this. 😀 My next plan for earthstar-graphql was to add filters to many of the fields, e.g.

{
    workspace(address: “+gardening.123”) {
        documents(author: “toot.abc123”, pathPrefix: “/diaries”, after: 138238742987) {
            ... on ES3Document {
                # selections

            }
        }
    }
}

While I originally planned this for client convenience, I now see that it’s a great fit for the kind of ‘efficient sync’ you describe above: a client can send a specific query to earthstar-GraphQL and get everything it needs to ingest a bunch of documents into a IStorage.

I can now see a path to having something like a syncGraphQL(storage: IStorage, graphqlUrl: string) export in this package.

(I’ve also been doing my homework on making the server better at handling deep queries, and feel better about this too... basically I spoke too soon!)